Method and apparatus for use in processing signals

ABSTRACT

New dialogue is post-synchronized with guide track dialogue by using signal processing apparatus in which the analog guide track signal x 1  (t) undergoes speech parameter measurement processing in a processor (43) to provide a speech parameter vector A(kT). The new dialogue signal x 2  (t&#39;) is processed to give waveform data which can be stored on disc (25) and a speech parameter vector B(jT) from a parameter extraction processor (42). The variables k and j are data frame numbers, and T is an analysis interval. Some parameters of the vector B are used in process (48) to classify successive passages of the new dialogue signal into speech and silence, to produce classification data f(jT). The vectors A and B and the classification data are utilized in a time warp processor SBC2 to determine a time-warping function w(kT) giving the values of j in terms of the values of k associated with the corresponding speech features, and thereby, indicating the amount of expansion or compression of the waveform data of the new dialogue signal needed to align the time dependent features of the new dialogue signal with the corresponding features of the guide track signal. Editing instructions are generated in signal editor computer SBC1 from the w(kT) data, feature classification data, pitch data p(jT) and the data stream x 2  (nD) so that the editing of x 2  (nD) can be carried out by the computer SBC1 in which periods of silence or speech are lengthened or shortened to give the alignment. The edited data x 2  (nD) is converted to analog by a converter unit (29), and low pass filtered to provide an audio output signal to be recorded as the synchronize new dialogue.

This application is a continuation, of application Ser. No. 586,226filed Nov. 22, 1983, now abandoned.

This invention relates to a method and apparatus for use in processingsignals.

During the production of a film soundtrack, it is often necessary ordesirable to replace original dialogue, recorded live at the time ofshooting the picture, with dialogue recorded afterwards in the studio,since the original dialogue may be unacceptable because of, for example,a level or type of background noise that cannot be eliminated. Thestudio recording takes place before the final soundtrack is formed froma mix of dialogue, music and sound effects, and is calledpost-synchronising or post-synching.

The post-synchronising technique most widely used today is known as thevirgin loop system and is operated as follows.

The soundtrack editor breaks down the dialogue scenes to be post-synchedinto sections of one or two sentences each of up to about 30 seconds induration. Each section, which consists physically of a length ofpicture-film and an equal length of magnetic film containing theoriginal dialogue recording, is then made into two endless loops. Athird loop (also of the same length) is made up from unrecorded magneticfilm. This is the "virgin loop". The loop of magnetic film containingthe original dialogue is now called the "guide track".

Each of the actors involved in the scene attends individually at astudio especially designed for postsynching. The picture-film loop isloaded onto a film projector, the guide track is loaded onto a magneticfilm reprocucer and the virgin loop is loaded onto a magneticrecorder/reproducer. These three machines are adapted to operate insynchronism. The picture-film loop is projected onto a screen in frontof the actor. The guide track is replayed to him over headphones, and heendeavours to speak his lines in synchronism with the original dialogue,his efforts being recorded onto the virgin loop. Guide track cues (bleeptones) or chinagraph cue-marks which the editor has drawn beforehand onthe picture-film loop are provided. The actor makes repeated attempts atmatching the exact timing and performance of the guide track until thedirector decides that the result is satisfactory. It is possible at anytime to switch the machine with the virgin loop from record to playbackin order to check the result on a studio loudspeaker.

Once successfully recorded, the loops are removed from the machines andare replaced with the next set of loops covering the next section ofdialogue. The entire operation is then repeated for this new section. Anaverage feature film may require several hundred dialogue loops, eachone of which may have to be recorded several times with fresh virginloops, depending on the number of actors in the scene.

The task facing the actor is difficult, since a difference of one to twofilm frames from synchronism between words and mouth movements isnoticeable to the average viewer but is only 0.05 to 0.1 secondsdifference. Inevitably, artistic expression becomes subordinated to theneed to speak in synchronism. Frequently, after many attempts acompromise is settled for which is nearly right and which tne soundtrackeditor knows from experience will enable him to take the magnetic filmback to the editing room and with fine cutting, pull the words intosynchronism.

The newly recorded loops are eventually assembled into the places in thedialogue track previously occupied by the original dialogue.

The virgin loop system is laborious and time-consuming, and is greatlydisliked by actors. Furthermore, it is a generally held view in the filmindustry that post-synched dialogue is always inferior to original livedialogue from an acting point of view.

With the development of film transport machines capable of high-speedoperation in forward and reverse and having logic control, a methodknown as Automatic Dialogue Replacement (ADR) has come into use in thenewer studios.

One example of such a studio is described by Lionel Strutt in an articleentitled "Post-Synchronising Sound: Automated Dialogue Replacment usingthe Computer" at pages 10 196 to 198 in The BKSTS Journal of March 1981,published in England. In ADR it is not necessary to break the filmphysically into loops. Rolls of picture film, corresponding guide trackand virgin magnetic film are loaded onto the respective picture filmprojector, magnetic film reproducer and magnetic filmrecorder/reproducer in their entirety, and each loop is formedelectronically, in that the machines play through the respectivedesignated dialogue section at normal speed, then fast return back tothe beginning of the section and repeat, all locked in synchronism. Forexample, in the Magnatech 600 Series EL system, interlock pulsesare sentby the 8LB Interlock Generator to each slave machine, i.e. to thepicture film projector, the guide track reproducer and the virginmagnetic film recorder/reproducer. These pulses, which are generated ata rate of ten pulses per film frame, and provided in the form of twosquare waves which are 90° out of phase with one another, the secondlagging the first for forward motion and the first lagging the secondfor reverse motion. Four modes of movement are possible under thecommand of the MTE 152 Processor: normal speed forward and reverse, andfast forward and reverse. At the normal running speed, the pulsefrequency of the interlock pulses transmitted by the interlock generatorto the three machines is quartz oscillator controlled. These interlockpulses are also routed to the MTE 9E counter. In a post-synchingoperation, the rolls of film are laced into the machines at their heads,and a sync mark which the editor has marked beforehand on all the rollsis used to ensure that the three films are adjusted to be in stationarysync. This sync mark is usually designated as 0 feet 0 frames and anypoint on the rolls can be identified by the number of feet and filmframes from the sync mark. Each length of picture film and correspondingguide track which is to be treated as a loop, and which is referred toas a designated loop section, can be specified by two sets of filmfootage and frame numbers entered into a preset unit (the MTE 151EPre-set), one set defining the beginning, the other the end of thedesignated loop section. When the rolls of film are laced at the syncmark the MTE 9E counter is reset to zero (0000,00). The MTE counter isthen able to produce a 6-digit binary-coded-decimal signal of footageand frames corresponding to the instantaneous position of the filmtransport machines relative to the film rolls by counting the interlockpulses from the 8LB interlock generator. This BCD signal is supplied tothe MTE 151E Pre-set where it is compared with the two sets of BCDfootage and frame numbers entered by the operator as start and finishframe identification for the designated loop section. The result of thiscomparison is supplied to the MTE 152 Processor as either an AHEAD OFLOOP signal, an IN LOOP signal, or a PAST LOOP signal. In use, the MTE152 Processor cycles the machines through a selected designated loopsection by starting from a point 5 to 10 feet in front of the loop entryframe, i.e. the first frame in the designated loop section, then runningat normal speed through to the end of the designated loop section, andthen rewinding at fast reverse speed and repeating the cycle. Attransition from ahead of loop to in loop, the 151E preset for loop entryframe matches the MTE 9E counter BCD signal and the MTE 152 Processorproduces a MASTER RECORD On signal which activates the recordingfunction of the recorder/reproducer. Similarly, this signal is switchedoff at transition from in loop to past loop. The analog audio signalsfrom the magnetic film reproducer and the actor's microphone are routed,via a mixing console for example, to the actor's headphones and themagnetic film recorder/reproducer respectively.

In relation to the virgin loop system, ADR has the advantages that theduration of each designated loop section can be specified and alteredduring a post-synching session to suit an actor, and that more than themost recently produced recorded loop can be replayed for assesment byactor and director.

However, the sound editor still has to edit the post-synch dialogue to"pull" it into acceptable synchronism. Furthermore, the several actorsin a scene cannot record onto separate multi-tracks on the virgin stock,since cutting one would interfere with the others alongside it. Thus aseparate roll of virgin magnetic film is required for every actor in ascene.

Similarly, where videotape is used instead of film, post-synching ofdialogue must sometimes be carried out, and, hitherto, the methods usedhave been analagous to those for film ADR.

The aspect of conventional post-synching which is the principal cause ofdifficulty and constraint is the necessity for the actor to beginspeaking at a predetermined instant to within a fraction of a second,and to maintain synchronism to the end of a spoken passage. There is aneed for a method and equipment which makes post-synching less onerous.The present invention arises out of attempting to provide such a methodand equipment but is not limited to the processing of speech signals forthe purposes of post-synching. The present invention may be applied inother circumstances in which a second signal substantially resembling afirst signal is edited as regards the relative timing of particularfeatures of the second signal so as to align these particular featureswith the corresponding features in the first signal whereby an output isproduced which substantially replicates the first signal at least asregards the timing of the particular features chosen. The presentinvention may be regarded as providing a method and signal processingapparatus for finding chosen features in two similar signals andautomatically editing one of these signals to as to subtantiallyeliminate any relative timing discrepancies between corresponding chosenfeatures of the two signals without the editing affecting essentialsignal characteristics.

According to one aspect of the present invention, there is provided amethod of processing signals, the method having the steps of producingdata related to selected time dependent features of a first signal anddata related to the same time-dependent features of a second signalwhich substantially resembles the first signal; utilizing the saidfurther data so as to produce data representative of difference betweenthe timing of features of the second signal and the timing ofcorresponding features of the first signal; producing datarepresentative of the waveform of the second signal in a medium suitablefor signal editing; utilizing the timing difference data to generateediting data suitable for editing the data representative of the secondsignal so as to produce output data representative of an edited form ofthe second signal which substantially replicates the relative timing ofthe said features of the first signal, and editing the datarepresentative of the second signal in accordance with the editing data.

According to another aspect of the present invention, there is providedsignal processing apparatus comprising: means for determining from firstand second signals data related to selected time-dependent features ofthe said signals; means for utilizing the said data so as to producedata representative of difference between the timing of the saidfeatures of the second signal and the timing of substantially the samefeatures in the first signal; means for producing and storing datarepresentative of the second signal waveform; means for utilizing thetiming difference data so as to generate editing data suitable forediting the data representative of the second signal to produce outputdata representative of an edited form of the second signal whichsubstantially replicates the relative timing of the said features of thefirst speech signal; and means for effecting such editing.

According to a further aspect of the present invention there is provideda method for use in producing recorded speech, the method having thefollowing steps: producing digital data representative of a secondspeech signal; which is substantially imitative of a first speechsignal; processing the said first and second speech signals at regularintervals to determine therefrom the occurrence and/or value of selectedspeech parameters of the first and second signals; generating digitaldata indicating the presence or absence of speech in the second signalin repsonse to processed digital data representative of the occurrenceand/or value of selected speech parameters in the second signal;generating digital data representative of pitch in the second signal;utilizing the sequences of digital data indicating the presence orabsence of speech and representative of speech parameters of the firstand second speech signals to generate digital data representative ofdifference between the timing of the said characteristic features of thesecond speech signal and the timing of the corresponding characteristicfeatures of the first speech signal; processing the digital datarepresentative of pitch and the said difference in timing and thesequence of digital data indicating the presence or absence of speech inthe second speech signal so as to generate editing data in accordancewith a requirement to substantially replicate with the saidcharacteristic features of the second speech signal the timing of thecorresponding characteristic features of the first speech signal byadjusting the durations of silence and/or speech in the second speechsignal; and editing the digital data corresponding to the second speechsignal in accordance with the editing data and generating thereby editeddigital data corresponding to an edited version of the second speechsignal.

According to another aspect of the invention there is provided a digitalaudio system including means for storing digital data corresponding to asecond speech signal which is substantially imitative of a first speechsignal; means for reading the said digital data from the said storingmeans; means for determining from the first and second signals atregular intervals the occurrence and/or value of selected speechparameters of the first and second signals; means for generating digitaldata encoding characteristic acoustic classifications such as silence,unvocalised sound and vocalised sound in response to processed digitaldata representative of the occurrence and/or value of selected speechparameters; means for generating digital data representative of pitch inthe second signal; means for utilizing the sequences of digital dataencoding the said characteristic classifications and representative ofspeech parameters of the first and second speech signals to generatedigital data representative of difference between the timing of thecharacteristic features of the second speech signal and the timing ofthe corresponding characteristic features of the first speech signal;means for processing the digital data representative of pitch and thesaid difference in timing and the sequence of digital data encodingcharacteristic classifications so as to generate editing data inaccordance with a requirement to substantially replicate with thefeatures of the second speech signal the timing of the correspondingcharacteristic features of the first speech signal by adjusting thedurations of silence and/or speech in the second speech signal; andmeans for editing the digital data corresponding to the second speechsignal in accordance with the editing data and generating thereby editeddigital data corresponding to an edited version of the second speechsignal.

According to yet another aspect of the invention there is providedrecorded speech produced by a method or with an apparatus or system asdefined in any of the preceding four paragraphs. The recorded speech maybe in the form of a dialogue track for a film or videotape.

In general, it frequently occurs that a signal of interest, which can berepresented as a function of time t by s₁ (t), can only be recordedunder less than ideal conditions. Typically, in being recorded, suchsignals pass through a linear, time invariant system, of impulseresponse h(t), and are corrupted by additive noise which is also afunction of time, q(t). Only the resulting signal x₁ (t) can be capturedat a receiver. In other instances where since there is no degradation x₁(t)=s₁ (t), the signal may still not be satisfactory for other reasons.Nevertheless, time-dependent features of s₁ (t) which are significantfor some purpose have occurred at specific moments in time and it is therelative timing of the occurrence of these features that often must bepreserved. Such an unsatisfactory signal x₁ (t) with significanttime-dependent features will now be referred to as a reference signal.In applying the present invention to these circumstances, a first stepis the provision of a second signal x₂ (t'), which will now be referredto as the replacement signal and where t' indicates that x₂ (t') is afunction of time on a scale independent of t, that contains essentiallythe same sequence of time-dependent features as s₁ (t) but whosefeatures occur with only roughly the same timing as the correspondingfeatures of s₁ (t).

Normally it is not necessary that t and t' begin from the same absolutemoment in time because either or both x₁ (t) or x₂ (t') may be storedfor later access and retrieval. It should be noted that t and t' canrefer to the time scale of either the actual or stored reference orreplacement signals, respectively. The times t=0 and t'=0 refer to thebeginnings of signals x₁ (t) and x₂ (t'), respectively, whether theseare the actual signals or their stored versions. Furthermore, the firstsignificant event to occur in x₁ (t) is the beginning of the signal s₁(t) at some value t>0 and, similarly, a corresponding signal of interests₂ (t') in x₂ (t') begins in x₂ (t') at some value of t'>0. Selectedphysical aspects of the signals x₁ (t) and x₂ (t') are periodicallymeasured and from these measurements values of useful signal parameters,including time-dependent parameters, are determined. The measurementsare carried out at a sufficiently high rate for significant changes inthe characteristics of the signals x₁ (t) and x₂ (t') to be detected.The replacement signal is also classified from the sequence of some orall of the parameters, the classification referring to whether thesignal of interest s₂ (t') is present or not in x₂ (t') over themeasurement period. The time-dependent parameters of each measuredsignal and the time-dependent classifications of the replacement signalare then processed using pattern matching techniques to produce atime-dependent function, which may be referred to as a time-warpingpath, that describes the distortion of the time scale of the replacementsignal x₂ (t') that must take place to give the best replication of thetiming of the time-dependent features of the reference signal. The timescale distortion function is analysed to detect the presence ofsufficient discrepancies between the reference and replacement signals'time scales to warrant alterations being made to the signal waveform ofthe replacement signal to achieve the desired alignment of significantfeatures occurring on the time scale of the replacement signal with thecorresponding significant features on the time scale of the referencesignal. The information obtained from this analysis of the time-scaledistortion is utilized with information on the time-dependentclassifications of, and possibly pitch and waveform data of, thereplacement signal to generate detailed control information for anediting process which is to operate on the replacement signal. Thiscontrol information is then used in the editing process in which thecontrol information actuates the deletion and/or insertion ofappropriate sequences of signal data from or into the replacement signalso as to substantially replicate the timing of the significant relativetime-dependent features of the reference signal in the edited signal.

In accordance with a preferred embodiment of the present invention acomputer system with a large disc storage is arranged to record andautomatically post-synchronise new dialogue with an original guidetrack. The system adjusts the timing of the new words primarily byaltering the duration of the silent gaps between words and, inacceptable situations, by adjusting the duration of the speech elements.The decisions controlling this "microediting" of the speech are based ona knowledge of the production and perception of speech and willtherefore ensure that the edited speech sounds natural. The processingdoes not necessarily take place in real time. It takes place duringrecording of the new dialogue, and if necessary, during wind-back andplayback phases of the operation and thus causes no delays. Thispreferred computing system has an analog to digital and digital toanalog conversion system coupled via a large buffer memory andinput/output interface to a high speed (i.e. 1.2 M. bytes/sec) datatransfer bus. A dual channel parameter extraction process system coupledvia an I/0 interface to the bus, a large cpacity (i.e. 84 M. byte)magnetic disc memory coupled via a disc controller to the bus, suitablehardware for receiving film frame position and control signals producedby a Magnatech EL system and transmitting control signals to theMagnatech EL system coupled to a parallel input/output port of a singleboard computer with on-board random access memory which is in turncoupled to the bus, a logic control and data entry keyboard and VDUcoupled to a serial input/output port of the single board computer, anda second single board computer coupled to the bus and via a seria1 orparallel port to the other single board computer.

This invention will now be described by way of example with reference tothe accompanying drawings, in which:

FIG. 1 is a block diagram of a post-synchronising system embodying theinvention,

FIG. 2 is a more detailed block diagram of a processor in the system ofFIG. 1, the processor embodying the invention,

FIG. 3 is a block diagram of part of the processor of FIG. 2,

FIG. 4 is a block diagram representing schematically processes carriedout by part of the processor of FIG. 2,

FIG. 5 is a schematic diagram of an interface in the processor of FIG.2,

FIG. 6 is a block diagrammatic representation of the processing effectedby the processor of FIG. 2,

FIGS. 7, 8 and FIG. 9, consisting of (a)-(c), are graphicalillustrations for explaining some processes effected in the proccessorof FIG. 2,

FIG. 10 is a flow diagram of part of the processing effected in theprocessor of FIG. 2,

FIGS. 11 and 12 are graphical illustrations of data organization andprocessing effected in the process of FIG. 2,

FIG. 13 is a group of three graphical illustrations, (a)-(c), forexplaining processes in the processor of FIG. 2,

FIGS. 14, 15 and 16 are flow charts illustrating three stages ofprocessing effected in the processor of FIG. 2,

FIG. 17 is a graphical illustration of a selection procedure included inthe processing illustrated by FIG. 16,

FIG. 18 is a graphical illustration of a computed time warping path andits relationship to an input analog signal and a resulting output analogsignal,

FIG. 19 is a set of five graphical illustrations (a)-(e) for explainingthe processing by the processor of FIG. 2 in relation to analog signals,

FIG. 20(a), 20(b) and 20(c) form a flow chart illustrating processing inthe editing effected in the processor of FIG. 2, and

FIG. 21 is a detailed block circuit diagram of part of the processor ofFIG. 2.

FIG. 1 illustrates schematically an embodiment 10 of the inventioncooperating with automated dialogue replacement studio equipment toprovide edited replacement dialogue which is in synchronism with picturefilm. The automated dialogue replacement equipment consists of anactor's microphone 11, an audio console 12 and Magna-Tech Electronicunits MTE 600 recorder/reproducer 13, MTE 600 guide track reproducer 14,MTE 152 processor 15, MTE 8LB interlock generator 16, MTE 9E counter 17,and MTE 151E pre-set unit 18, with interconnecting signal channels. AMagna-Tech PR 635 High Speed Projector (not shown) is also included forprojecting picture film.

In use, as in the automatic dialogue replacement method (ADR),respective rolls of picture film, corresponding guide track and virginmagnetic film are loaded respectively onto the film projector (notshown), the magnetic film reproducer 14 and the magnetic filmrecorder/reproducer 13. Signals from the actor's microphone 11 arerouted through the audio console 12 to the embodiment 10, referred to inFIG. 1 as a post-sync dialogue signal processor, which also receivesguide track audio signals from the guide track reproducer 14. An analogaudio output which is a version of the signal from the microphone 11edited into synchronism with the guide track audio signal from the guidetrack reproducer 14 by the embodiment 10 is supplied by the embodiment10 to the recorder/reproducer 13 through the audio console 12. As inconventional automatic dialogue replacement, a post-synching session isstarted from the MTE 152 processor 15 which cycles the projector (notshown) and the guide track reproducer 14 through a selected designatedloop section, starting 5 to 10 feet in front of the loop entry frame andthen running at normal film speed through to the end of the designatedloop section, the projector (not shown), the guide track reproducer 14,and the MTE 9E counter being supplied with interlock pulses from theinterlock generator 16 under the control of the MTE 152 processor 15.The interlock pulses are also supplied to the MTE 600recorder/reproducer 13, but recording by this recorder/reproducer 13 iscontrolled by the post-sync dialogue signal processor 10. The filmfootage and frame numbers are tracked conventionally by the counter 17and AHEAD OF LOOP, IN LOOP, and PAST LOOP signals are provided by thepre-set unit 18 and supplied to the MTE 152 processor 15 in the knownmanner. Motion commands supplied to the interlock generator 16 by theMTE 152 processor 15 are the known fast forward and reverse, normal filmspeed forward and reverse, stop and the other standard commands providedby the MTE 152 processor for the MTE 8LB interlock generator. The MTE152 processor MASTER RECORD and record/playback status signals which areunder operator control are supplied to the post-sync dialogue signalprocessor 10 which utilizes these signals in its processing. The MTE 600recorder/ reproducer 13 also produces a SYNC SPEED FORWARD signal whenit is running at normal speed forward and this signal is supplied to thedialogue signal processor 10 for utilization. The BCD film footage andframes number signal generated by the counter 17 is supplied to thedialogue signal processor 10 to provide data utilized in the processing.

FIG. 2 shows schematically the post-sync dialogue processor 10 whichembodies the invention. As shown in FIG. 2, the signals supplied to theprocessor 10 by the Magna-Tech Electronic units 13, 15 and 17 are inputsto a circuit referred to herein as a Magnatech interface 19 which isshown in FIG. 5 to include a multiplexer 20 for converting the 6-digitBCD footage and frames signal from the counter 17 into a single digitparallel input to a first single-board computer SBC1, shown in FIG. 2,having a 128 kilobyte memory and controlling the multiplexer 20,receiving through respective buffers 21 of the interface 19 the systemstatus record and playback signals and the master record and sync speedforward signals, and outputting through a further buffer 22 of theinterface 19 a master record signal to the recorder/reproducer 13. TheMTE 152 processor 15 is enabled by this arrangement to serve as a masterconsole.

During a cycle of a designated loop section, with RECORD mode selectedat the MTE 152 processor 15, the next signal of interest is MASTERRECORD active. This signal is generated by the MTE 152 processor 15 ifthe conditions RECORD MODE SELECTED, SYNC SPEED FORWARD COMMANDED, andIN LOOP active are all present and corresponds to detection by thepre-set unit 18 of the exact footage/frames of the start of thedesignated loop section. At this point the following instructions arecarried out:

1. Read BCD start footage/frames and store in memory in the firstcomputer SBC1.

2. Send message to the time warp processor computer SBC2, to start, andstore time warping path and classification in memory in the computerSBC2 for access by the first computer SBC1 to generate editing datawhich is then stored in the memory in the first computer SBC1.

3. Reset analog-to-digital unit 28

4. Enable interrupt from analog-to-digital unit 28 when MASTER RECORD isoff i.e. not active

5. Wait for data from SBC2 to commence editing.

When MASTER RECORD is turned off by the MTE 152 processor, correspondingto the finish frame of the designated loop section, the followinginstructions are carried out:

1. Read BCD finish footage/frames and store in the memory in the firstcomputer SBC1.

2. Carry on digitising dub for 2 seconds.

3. Empty last data buffer in analog-to-digital unit 28, disableinterrupt from analog-to-digital unit 28.

4. Compute number of last processing interval and send to SBC2.

5. Complete editing operations.

Having cycled once in the RECORD mode, the MTE 152 processor 15 jumpsinto PLAYBACK mode automatically at the loop finish point, and will thengo into rewind to a point before the loop start and then enter normalspeed forward. The next signal of interest is the SYNC SPEED FORWARDgenerated by the recorder/reproducer 13. Monitoring of this signal bythe dialogue signal processor 10 prevents a digital to analog output ofthe edited dub when the BCD footage/frames position matches the storedloop start point as the MTE 152 processor 15 effects fast wind backthrough the loop.

When the SYNC SPEED FORWARD signal is received, (the MTE 152 processor15 mode already being PLAYBACK): the following are carried out:

1. Pre-load data buffer of digital-to-analog unit 29 with mute on, (seedescription of FIG. 21 hereinafter).

2. Match BCD footage/frames with loop start frame in memory (use leastsignificant bit of counter to strobe the footage counter bits).

When the loop start frame is reached:

1. Supply MASTER RECORD signal to recorder/reproducer 13 from theprocessor 10.

2. Reset buffer address pointer to zero, and turn mute off, (outputbegins).

At loop finish point:

1. Switch off MASTER RECORD signal from processor 10.

No part of the dub will be lost on magnetic film since although in theRECORD mode the actor may have been speaking after the loop finishpoint, this speech will have been warped back to within the loop sectionby the dialogue signal processor 10.

The first single-board computer SBC1 is coupled to a similar secondsingle-board computer SBC2 for i/o port handshakes for interboardcommunication by a bus 23, and both computers SBC1 and SBC2 areconnected to a multibus 24 for two-way traffic of data, address andcontrol signals. To provide adequate storage for the dialogue processingto be effected an 84 megabyte Winchester disc store 25 is coupled to themultibus 24 by a disc controller 26. The first computer SBC1 serves assystem controller and as a signal editor in editing processes to bedescribed hereinafter. The second computer SBC2, which also has 128kilobytes of memory, serves to carry out time warping processes. Thecomputers SBC1 and SBC2 may each be an SBC 86/30 by Intel Corporation.The multibus 24 can then be a multibus card frame SBC 608 by IntelCorporation, and the disc controller 26 an SBC 220 by Intel Corporation.The disc storage 25 may be an M23128K by Fujitsu.

A visual display unit (VDU) and data entry terminal 27 is coupled to thefirst computer SBC1 to allow processing parameters chosen by the user tobe entered into SBC1.

Audio signals from the actor's microphone 11 routed by the audio console12 to the post-sync dialogue signal processor 10 enter as analog inputto an analog-to-digital converter unit 28 shown in more detail with adigital-to-analog converter unit 29 and a shared buffer 30, businterface 31 and control unit 32 in FIG. 3. The bus interface 31 couplesthe buffer 30 and control unit 32 to a data and control bus 33 connectedto the multibus 24. When the bus interface 31 is enabled by a respectivesignal from the multibus 24, control signals are passed through the businterface 31 to the control unit 32 which controls a sample and holdcircuit 34 and an analog-to-digital converter 35. Microphone signalspass through a buffer amplifier 36 to a low pass filter 37 beforereaching the sample and hold circuit 34. The signal samples produced inthe sample and hold circuit 34 are digitized by the converter 35 and thedigital output is supplied to the buffer 30, which is large, foraccessing by the first computer SBC1. The control unit 32, bus interface31 and buffer 30 also take part in the outputting of edited dialoguedata, this data being transferred from the data and control bus 33 bythe bus interface 31 to the buffer 30 and thence to a digital-to-analogconverter 38. The analog output from the converter 38 is supplied to ade-glitch amplifier 39, which is a known circuit for removing non-speechtransient components resulting from digital-to-analog conversion, andthe output from the de-glitch amplifier 39 is passed through another lowpass filter 40 to an audio output amplifier 41. The analog audio outputfrom the output amplifer 41 is the output supplied by the dialoguesignal processor 10 to the MTE 600 recorder/reproducer 13.

The audio input signal from the actor's microphone is also supplied toone of two identical speech parameter extraction processors 42 and 43,inscribed DUB PARAMETER EXTRACTION PROCESSOR. The other parameterextraction processor 43, inscribed GUIDE TRACK PARAMETER EXTRACTIONPROCESSOR, receives the audio output signal from the MTE 600 guide trackreproducer 14. The guide track parameter extraction processor 43 will bedescribed in more detail hereinafter with reference to FIG. 4. The twoparameter extraction processors 42 and 43 are coupled to the multibus 24by a bus interface 44.

In a post-synching session, the Magna-Tech 152 Processor 15 cyclesthrough a designated loop section, during which the actor attempts tospeak his lines of dialogue in imitation of the signal on the guidetrack, the corresponding length of picture film being synchronouslyprojected for the actor to see. At the loop entry point in this firstcycle, the actor, having received a visual or aural cue, beginsspeaking. The actor's microphone 11 is connected to theanalog-to-digital converter unit 28 so that as he speaks, the speechsignal produced by the microphone 11 is digitised by the converter 35and stored in the magnetic disc store 25. This digitising begins at theprecise moment of loop entry and continues, the footage/frame of theentry point having been entered into memory of the first computer SBC1.The actor's microphone is also connected to the dub parameter extractionprocessor 42, the guide track parameter extraction processor 43 isconnected to receive the guide track audio signal from the guide trackreproducer 14, and at the same time in the two computers, SBC1 and SBC2,analysis and processing of the actor's and guide track speech signalsand generation of editing data can begin, and the editing data soproduced be entered into the memory of the first computer SBC1. At theloop finish point, the BCD footage/frame is entered into memory and thedigitising, storage and analysis of the actor's speech continues forabout two seconds after the loop finish point in case he is stillspeaking. The processing of the actor's and guide track speech datacontinues during the fast rewind phase of this first cycle of thedesignated loop section and is completed, possibly during the rewind.

This first cycle is repeated if the actor's performance is notsatisfactory.

The next step is a second or further cycle through the designated loopsection during which the actor's speech data stored in the disc store 25is read out, edited by the first computer SBC1 in accordance with thestored editing data and converted by the digital-to-analog converterunit 24 into an analog signal and thence by a studio loudspeaker unit(not shown), including any necessary amplifier stages, into an audiblespeech signal. The adequacy of the new speech signal generated, in theform of the digital data stored in the disc store 25 and edited by thefirst computer SBC1, as dialogue for the film is assessed by thedirector and actor during this second cycle. At the same time the analogsignal is supplied to the magnetic film recorder/reproducer 13 whichrecords the new dialogue onto the virgin magnetic film, the systemactivating and de-activating the record function of therecorder/reproducer 13 at the loop entry and exit points respectivelyprovided the sync speed forward signal is active. If the new dialogue issatisfactory, a start is made on the next designated loop section. If,however, the edited data does not give a satisfactory effect with thepicture film, the process is repeated.

In FIG. 6 which is a block diagram representing the digital dataprocessing carried out by the dialogue processor 10, data processingsteps are indicated by legends within blocks, so that the blocks withsuch legends may be representative of processes carried out by acomputing system, or of hardware units for carrying out such processes,or in some cases such hardware units and in other cases processescarried out by a computing system cooperating with the hardware units.

In FIG. 6, the guide track analog signal is mathematically representedas a function x₁ (t) of an independent variable t which is a measure oftime, and the analog signal from the actor's microphone 11 ismathematically represented as another function x₂ (t') of anotherindependent variable t' which also is a measure of time in the sameunits as the variable t but of independent origin.

The generation of the speech parameters from the recorded guide trackand the dub involves processing by and the periodic output of parametersfrom the two extraction processors 42 and 43. These parameters arestored at least temporarily until they are processed as data sequencesin the processing apparatus. One set of data-sequences is generated forthe designated guide track loop and another is generated for the spokenattempt (the dub) by the actor. Evaluation of minor timing variationsbetween these data sequences takes place using a pattern matchingalgorithm based upon dynamic programming techniques used in speechrecognition systems. Once time-warping. data is generated, then digitalediting of the computer-stored speech waveform data can commence.Editing decisions are based on algorithms designed to allow minimumperceivable disturbance to the audible speech sound quality whilstapparently achieving perfect synchronism in relation to mouth movementsvisible from the projected film pictures.

In post-synching, during the cycle in which the actor speaks, generationand processing of speech parameters from both the guide track signal x₁(t) and the microphone signal x₂ (t') takes place. The generation of thespeech parameters for the guide track signal x₁ (t) and dub signal x₂(t') is represented in FIG. 6 by blocks 45 and 46 respectively.

This parameter data may optionally be stored on disc for later retrievaland processing or as it is generated it may be immediately processed ina block 47 inscribed GENERATE TIME WARPING PATH to produce timealignment data, referred to herein as a time warping path, whichdescribes how best to align significant features of the dub withcorresponding features of the guide track. In addition, segments of thedub are classified as speech or silence in a process block 48 from someor all of the parameter data. When a sufficient amount of time alignmentdata is available, it is used in a process block 49 inscribed GENERATEEDITING DATA in conjunction with the classification data from block 48and, if necessary, fundamental period data of voiced dub segments, froma block 50, to permit microediting, i.e. editing of the fine structure,of the digitised stored dub waveform (retrieved from the disc store 25)to take place in a process 51 where it is required in the dub waveform.Any new edited waveform segments can be stored in a second part of thedisc store 25 and a `table` of editing operations can be prepared forconstructing the complete edited waveform during the next step from thestored edited waveform segments. The processing just described continuesfor a few seconds beyond the loop exit point to ensure that if the actoris speaking too slowly, the end of the speech will not be cut off andlost.

If the parameter data has been stored on disc, all of the aboveprocessing of the parameter data and microediting may continue duringthe rewinding of the picture film and guide track and possibly duringthe playback step described next. If the parameter data is not stored,it must be processed at an average real-time rate sufficient forproduction of the time warping path and the classification data inblocks 47 and 48. However, if the time-warping path is stored in memorythe processes of deriving the fundamental period data (block 50),generating editing data (block 49), and editing (block 51) thereplacement signal may continue during the fast rewind and playbackphase of the second cycle. The main requirement is that any part of thedub data to be played back must be completely processed before it isplayed back.

The selection of the specific types of processing used to analyse theguide track signal x₁ (t) and the dub signal x₂ (t') and therebygenerate parameters once every T seconds where T seconds is a suitablyshort interval, is somewhat arbitrary in that numerous parametersreflect the underlying time-varying nature of speech. Measurementoperations may be grouped conveniently according to the computationalmethod which is used to produce the parameters. In general, three usefulcategories exist.

In the first, if sampled versions of both signals x₁ (t) and x₂ (t') aremade available by some means, parameters can be generated by parallelprocessing of blocks of (stored) samples of these signals. For eachsignal, the blocks of samples may or may not be overlapped, depending onthe amount of independence desired between blocks of samples. Among themost commonly used sample-block-oriented parameters for speech patternmatching are the short-time zero-crossing rate, short-time energy,short-time average magnitude, short-time autocorrelation coefficients,short-time average magnitude difference function, discrete short-timespectral coefficients, linear predictive coefficients and predictionerror, and cepstral coefficients. Details of the definitions andprocedures for calculating each of the preceding short-time parametersare found in "Digital Processing of Speech Signals" by L. Rabiner and R.Schafer, published by Prentice-Hall of Englewood Cliffs, N.J., U.S.A. in1978.

The second category contains measurement operations which can beperformed by periodically scanning and sampling (once every T seconds)the outputs of analog filter banks analysing x₁ (t) and x₂ (t'). Severalsuch speech analysis systems are described in "Speech Analysis Synthesisand Perception. Second Edition" by J. L. Flanagan published bySpringer-Verlag of Berlin, Germany in 1972.

A third category of processing operations contains those which aresampled-data or digital signal processing implementations ofcontinuous-time analysis systems, the outputs of which may be sampledevery T seconds. A typical example (which is in fact the one used in theembodiment described herein) is a parallel digital filterbank, designedand implemented as described in references such as "Theory andApplication of Digital Signal Processing" by L. R. Rabiner and B. Goldpublished by Prentice-Hall of Englewood Cliffs, N.J., U.S.A. in 1975.This category requires (as in the first) that sampled versions of thetwo signals x₁ (t) and x₂ (t') are made available.

It is also possible to use parameters in any combination from thepreceding types of periodically-made measurements. However, theselection of the number of parameters used can vary and generallydepends on the following consideration:

Where the signal of interest s₁ (t) in the reference signal x₁ (t) isdegraded by noise and filtering effects, measurement of a large numberof parameters permits more reliable comparisons to be made between thereference and replacement signals x₁ (t) and x₂ (t'). The type anddegree of degradation influences the choice of parameter to be used insubsequent stages of processing. If the reference signal x₁ (t) consistspurely of the signal of interest s₁ (t), only a few parameters arerequired for use in subsequent processing operations.

Lastly, if a variety of types of parameters are generated, and each ofthese parameters is described by numbers lying within a particularrange, a means must be provided which normalizes each parameter so as toprovide substantially equal numeric ranges for each normalizedparameter. Such a normalization procedure is needed to ensure that thecontribution of each parameter to the pattern matching process whichgenerates the time alignment data will be roughly equivalent.

The main criteria for the selection of parameters are that successivesamples of any parameter should: (a) reflect significant changes withina speech signal which relate to physical aspects of the production ofthe speech; (b) be generated efficiently in hardware or software at arate significantly lower than that required to sample the dub waveform;and (c) not be easily contaminated by noise.

The rate (T⁻¹ seconds ⁻¹) at which sets of parameters are generated inparallel is referred to hereinafter as the `data frame` rate (asdistinguished from the film frame rate) or simply `frame` rate when noconfusion can arise. Thus the data frame rate is the rate at whichparameter vectors are generated. Therefore, once during each data frameperiod, parallel processing operations take place for both the guidetrack and the dub, and these processing results are then grouped intotwo respective data units which will be referred to as the guide (orreference) parameter vector and the dub (or replacement) parametervector.

In FIG. 6 various forms of signals are represented by different types oflines connecting the blocks: solid lines represent full bandwidth analogor digital signal routes; broken lines represent the routes of datasampled at the frame rate; and double broken lines represent paralleldata routes.

The reference signal x₁ (t), which in this example is the output of theguide track magnetic film reproducer 14 is played back, and at the sametime the replacement signal x₂ (t') which in this example is the outputof the microphone 11, is passed through the low pass filter 37 (FIG. 3)to the analog-to-digital converter 35. The filter 37 has a cutofffrequency, f_(c), which is located at the highest frequency to bereproduced. The sample and hold circuit 34 samples the filtered signalat intervals of D seconds, giving a sampling rate of D⁻¹ seconds ⁻¹ ofmore than twice the highest frequency to be reproduced. For the presentexample, a bandwidth of 15 kHz (=f_(c)) is sufficient and D is chosen tobe 32000 ⁻¹ sec. The sampling and conversion process produces a streamof digital data x₂ (nD) where n=0,1,2 . . . , representative of thesignal x₂ (t'). The data stream x₂ (nD) is written to disc 25 where itis held to be available for further processing. While the signal x₂ (t')is being sampled and written to disc, it is simultaneously processed bythe block 46 inscribed GENERATE PARAMETERS. Similarly, the signal x₁ (t)is simultaneously processed by the block 45. One of these two identicalblocks 45 and 46 is represented in further detail in FIG. 4.

In the present embodiment, a reference signal parameter vector A(kT) isformed in each guide track signal frame k, where k=1,2,3 . . . from thesampled and logarithmically-coded outputs of the guide track parameterextraction processor 43, which contains an N-channel digital filterbank.Simultaneously, in a parallel process, a replacement signal parametervector B(jT) is formed in each frame j, where j,=1,2,3 . . . from thesampled and logarithmically-coded output of the dub parameter extractionprocessor 44 which contains an N-channel digital filterbank. The twofilterbanks have identical characteristics. The parameter vectors forthe frame j=1 and k=1 are produced at the end of the first period of Tseconds and it will be assumed that the respective signals of intereststart after this first frame.

In FIG. 4 the details of the generation of A(kT) from x₁ (t) arepresented. The generation of B(jT) from x₂ (t') is performed identicallyand is therefore not shown or discussed separately.

As shown in FIG. 4, the input signal x₁ (t) first passes through avariable gain amplifier stage 52 of gain G that is adjusted to ensurethat a large proportion of the dynamic range of an analog-to-digitalconverter (A/D-C) 53 is used without clipping. The amplified analogsignal passes through a high-frequency boosting circuit 54 (inscribed HFBOOST), providing +6 dB/octave gain from 1 kHz to 4 kHz, whichcompensates for the rolloff of high-frequency energy in speech signals.The resultant signal passes through a lowpass filter (LPF) 55 (e.g. a7th-order elliptic design with passband cutoff at 4 kHz, transitionwidth 1.25, passband ripple 0.3 dB, and minimum stopband attentuation of60 dB) and the resulting filtered signal x₁ '(t) (where here the primeindicates a filtered version of x₁) is digitized by a combinationcomprising a sample-and-hold device (S/H) 56 followed by the converter53 which is in this example a 12-bit A-to-D converter (A/D-C) operatingat a sampling frequency of (cD)⁻¹ Hz to produce sampled data streamx.sub. 1 '(mcD) where m=0,1,2 . . . . The constant c should be aninteger in order that the rate (cD)⁻¹ be integrally related to the rateD⁻¹ used to sample the replacement signal for storage, editing, andplayback. By this means, synchronicity is maintained between the sampledsignal x₂ '(nD) and the frame indices j and k. The use of c=4 (andtherefore (cD)⁻¹ =8 kHz) allows a reduction in bandwidth and samplingrate and thus provides considerable economy in the processing requiredto generate the parameters. At the same time, very little significantinformation is lost.

The data stream x₁ '(mcD) enters a digital filterbank 57 comprising Nparallel bandpass filter sections BPFi, where i indicates a frequencyband number. In the present system N=4 and the filters used arerecursive implementations of 4th order Butterworth-designed bandpassfilters with the following cutoff (-3 dB attentuation) frequencies:

    ______________________________________                                        Band Number   Lower Cutoff                                                                             Upper Cutoff                                         ______________________________________                                        1              250 Hz     500 Hz                                              2              500 Hz    1000 Hz                                              3             1000 Hz    2000 Hz                                              4             2000 Hz    4000 Hz                                              ______________________________________                                    

The design and implementation of such filters is well known and isdescribed, for example, in "Theory and Applications of Digital SignalProcessing" by L. R. Rabiner and B. Gold published by Prentice-Hall ofEnglewood Cliffs, N.J. in 1975.

The permitted aliasing of a small range of frequencies in x₁ '(mcD)above 4 kHz into the high frequency band (i.e. band 4) is unusual butdesirable in that any speech energy above 4 kHz may make a usefulcontribution to the pattern matching processes to follow.

The output of each bandpass section BPFi is processed identically asfollows. Each BPF output is fullwave rectified in a block FWRi, and therectified signal passes through a lowpass filter LPFi comprising twofirst-order leaky integrators in series, each with cutoff frequency atapproximately 10 Hz. This filter smoothes the input signal and allowsthe resulting output to be sampled by a switch represented schematicallyin FIG. 4, every T seconds where T=0.01 sec. Lastly, the sampled outputdata is converted in a block LOG (by means of a look-up table) into an8-bit logarithmic quantity A_(i) (kT) where the subscript i indicatesthe ith band. Thus, A_(i) (kT) is one of the N components of anormalized parameter vector. Sequential access of these components,whose ranges are directly comparable, is then carried out in a block 59inscribed FORM PARAMETER VECTOR, which is a multiplexer, to form thecomplete parameter vector A(kT).

The movement of the parameter vector data from the filterbank processor43 to the next processing stage is accomplished by storing thesequential parameter vectors (comprising four bytes per frame perchannel or eight bytes per frame, total) in one of two large RAM buffermemories 60 and 61 (BUFFER MEMORY 1 and BUFFER MEMORY 2), each holdingan integral multiple number R of parameter vectors. When one of theselarge buffers 60 and 61 becomes filled, new parameter vectors are thendirected into the other buffer. Furthermore, while the second bufferfills, the processor SBC2 performing the generation of the time warpingpath may access the filled buffer and initiate the movement of thecontents to a further storage area for eventual access during process.After the data has been transferred from a filled buffer 60 and 61, thatbuffer may be overwritten with new data. Such a double-buffered systemensures no data is lost while data transfers are being made tosubsequent processing sections. It should be noted that the use of adouble-buffered memory for storing R parameter vectors means that afterfilling one buffer, if the kth parameter vector is the first one to bestored in one buffer, the (k-1-R)th to the (k-1)th parameter vectorswill then be immediately available from the previously filled buffer.Consequently, subsequent processing of the parameter vectors will notstrictly be in real-time, but the processing may operate at a real timerate on a variable-delay basis. The alternate operation of the buffers60 and 61 is represented schematically in FIG. 4 by ganged switches 62.

TIME WARPING PROCESSOR DESCRIPTION

The operation represented by the process block 47 (FIG. 6) inscribedGENERATE TIME WARPING PATH will now be described in detail. Thisoperation is carried out by the second single board computer SBC2. Thetime warping path is produced by processing the guide and dub parametervectors to find (on a frame-by-frame basis) the sequence of dubparameter vectors that best matches the fixed sequence of guideparameter vectors by allowing dub frames to be repeated or omitted. Inthis embodiment, the parameter vectors represent spectral cross sectionsof the guide and dub speech signals. To make comparisons of thesimilarity between a dub and a guide spectral cross section, a simpledistance metric can be used which compares not the original parametervectors, but ones that have been processed to emphasize mainly thedifferences in speech patterns and to not be sensitive to environmentalor recording conditions. The dub frame index sequence, i.e. the sequenceof values of j, which produces the best alignment of dub parametervectors with those of the guide defines the time warping path which willbe input to the editing operations of block 49.

It should be noted that herein the term `metric` means a mathematicalfunction that associates with each pair of elements of a set a realnon-negative number constituting their distance and satisfying theconditions that the number is zero only if the two elements areidentical, the number is the same regardless of the order in which thetwo elements are taken, and the number associated with one pair ofelements plus that associated with one of the pair and a third elementis equal to or greater than the number associated with the other memberof the pair and the third element.

The time warping path, which is a function of k and T and is writtenw(kT), may be more formally specified as a non-decreasing function ofthe data frame indices k of the reference signal parameter vector A(kT)with the following two properties: First, for k=1,2,3 . . . , K, w(kT)is a sequence of integers in the range from 1 to J inclusive, where Kand J are defined as the final frame indices of the reference signal andthe replacement signal respectively. (Generally, if the parameterizationof the reference and replacement signals takes place simultaneously,J=K). Secondly, w(kT) describes the best or optimal match of a sequenceof replacement parameter vectors B(w(kT)) to the reference sequenceA(kT). Consequently it will be assumed that w(kT), being the best matchof replacement parameter vectors to reference parameter vectors, alsodescribes as a function of time the distortion (i.e. stretching orcompression) of the time scale of the replacement signal x₂ (t') thatwill align, in time, significant time-dependent features in thereplacement signal x₂ (t') with the corresponding features in thereference signal x₁ (t).

Owing to the fact that the reference and replacement signals x₁ (t) andx₂ (t') are expected to be of fixed (but arbitrarily long) length, it ispossible to represent the function w(kT) as a finite length path in the(k,j) plane. An example of a time warping path which provides the bestmatch of a one-dimensional replacement vector to a one-dimensionalreference vector is provided in FIG. 7. By a one-dimensional vector ismeant a vector produced from a single parameter, i.e. N=1.

Because the k index represents the reference sequence, to which asequence of indices on the j-axis will be assigned, path boundaryconditions are rather loose in that there is some j_(o) such that atk=1, j_(o) =w(1T) and 1≦j_(o) ≦J. Similarly, there exists some j_(F)=w(KT) such that j_(o) ≦jF≦J. It will be apparent to those skilled inthe art that it is unnecessary for the path to start at j=1 and end atj=J. However, there must be a total of K path values, i.e. values ofw(kT).

The procedures used to discover the best of the enormous number ofpossible paths are, in part, derived from known word recognitiontechniques. In such techniques, a matching algorithm is used which, ifno constraints were imposed, would be capable of allowing anyreplacement parameter vector B(j) to be compared with any referenceparameter vector A(k) to give a measure of distance (or dissimilarity)denoted by d(k,j) between the two vectors. One useful definition ofd(k,j) is a weighted "city-block" difference in the N-dimensionalparameter space, i.e. d(k,j) is defined by: ##EQU1## where r_(i) (kT) isa weighting factor for the kth frame and is discussed hereinafter. Otherdistance measures, e.g. the squared Euclidean distance between thevectors, can be used. It will be seen that the value of d(k,j) will varywith j when k is constant.

Similarly., the sum of the values of d(k,j) when k is varied over itsrespective range 1 to K may be used to provide a score which varies whenthe values chosen for j for each particular value of k are varied.Scores accordingly furnish a useful numerical assessment of the matchingof a test sequence of replacement frames to the fixed sequence ofreference frames. Moreover, there is a minimum or best total score as kis varied from 1 to K and j from j_(o) =w(1T) to j_(F) =w(KT).

Given that the path starting point for determining the optimum score isfixed at k=1, the score is dependent only on the final frame index, K.Therefore the optimum score may be denoted by S(K) where ##EQU2## andthe notation for min indicates that the summation is to be taken overindices j (which are themselves a particular function of k) such thatthe resulting summation is minimized. Hence to find the best matching ofthe two sets of K vectors it is necessary to determine the sequence of Koptimum values of j (with appropriate path constraints) which minimizethe above summation S(K). The particular function of k which providesthis minimum over the range k from 1 to K is the formal definition ofthe optimum time warping path w(kT). Other time warping functions arealso described by C. Myers, L. Rabiner and A. Rosenberg in "PerformanceTradeoffs in Dynamic Time Warping Algorithms for Isolated WordRecognition" in volume 28, issue No. 6 of IEEE Transactions onAcoustics, Speech and Signal Processing, at pages 623 and 635, publishedin 1980.

In frame K, the optimal path can only be known to be optimal after A(KT)and B(JT) have been processed in the matching process. Furthermore,where continuous speech is being parameterized, K may often by on theorder of several thousand. Consequently it is necessary to drasticallyreduce the storage and processing of the vast amount of data which wouldbe demanded in a direct implementation of the above formulae for anexhaustive search for the optimum path. This can be accomplished throughthe use of a modified version of an efficient processing algorithm forgenerating time registration data for two substantially similarcontinuous speech signals that was presented by J. S. Bridle in a paperentitled "Automatic Time Alignment and its Use in Speech Research"presented at the Leeds Experimental Phonetics Symposium, 27th to 29thSeptember, 1982 at Leeds University, West Yorkshire, England.

The original algorithm developed by Bridle will now be briefly describedbefore the modified version is described. Bridle's algorithm, known asZIP (owing to its action of "zipping" two similar signal sequencestogether) operates by producing a restricted number of potentiallyoptimum path segments in parallel and pruning away the most unlikelycandidates which have the highest (worst) dissimilarity scores. Theproduction rules for extending the ends of the path segments aregoverned by principles of Dynamic Programming, constraints on the sizeand direction of path increments, and penalties for local time scaledistortion. The optimum path is discovered in segments as the poorcandidates are gradually pruned away, i.e. rejected, to leave longer andlonger path segments which will eventually have origins that merge intoa unique segment containing one or more path elements common to allremaining paths. If the pruning is done judiciously, the common segmentis part of the optimum path w(kT), and can therefore be output as suchup to the point where the path segments diverge. As the processingcontinues, for each reference frame processed, the path is extended oneincrement, since k increases by units; the necessary pruning takesplace; and the origins of the remaining paths are examined foruniqueness. However, outputting of optimum path segments takes placeonly when the beginnings, i.e. ends not being extended, of the pathsegments satisfy the requirement of convergence; thus the output of pathelements will generally be asynchronous with the processing of referenceframes.

The production of the time warping paths in the ZIP algorithm isefficiently performed by applying an algorithm similar to thosefrequently used to compute optimum scores in word recognition systems. Aknown word recognition system is described by J. S. Bridle, M. D. Brownand R. M. Chamberlain in an article entitled "A one-pass algorithm forconnected word recognition" at pages 899 to 902 of the Proceedings ofthe IEEE International Conference on Acoustics, Speech and SignalProcessing, Paris, May 1982. However, unlike word recognitionalgorithims, the optimum score discovered along the optimum path is notthe end product of ZIP, but the optimum path is. Consequently, ZIP isdesigned to process a number of paths starting from different origins inparallel, with each path produced describing the best sequence from thestarting to the final point. To explain this processing a partial pathscore will now be discussed.

By a simple extension of the preceding definition for the optimum scoreS(K), it is possible to define an optimum partial path score, S_(p), forthe path connecting any starting point (k_(s),j_(s)) to any end point(k_(e),J_(e)) (where k_(s) <K_(e) and j_(s) ≦j_(e)) as the minimumpossible sum of the distances d(k,j) for the range of k from k_(s) tok_(e) and the range of j from j_(s) to j_(e) : i.e. ##EQU3##

The function of k that generates the sequence of j which minimizes thisscore and therefore describes a best partial path segment is dependentupon j_(s) and j_(e) and may be written as w_(js),je (kT). It should beappreciated that for a given j_(s) and j_(e), only one sequence of jwill describe the best path over a fixed range of k. That means therewill be only one best path segment between any two poihts in the (k,j)plane. Moreover, w(kT)=w_(jo),jF (kT).

The search for paths which produce the minimum scores is carried out inZIP via a Dynamic Programming (or recursive optimization) algorithm. InDynamic Programming (DP) algorithms, two main principles are used indetermining S(K) and hence w(kT): (1) the optimum set of values of j forthe whole range of k from 1 to K is also optimum for any small part ofthe range of k; and (2) the optimum set of values of j corresponding tothe values of k from k_(s) to any value k_(e) for which there is acorresponding j_(s) and j_(e) depends only on the values of j from j_(s)to j_(e).

Using these principles, ZIP generates values of the best partial scoreaccording to the following recursive DP equation: ##EQU4## in which afunction P(a) is included so that the score will include a penalty forlocal timescale distortion. The above equation for S_(p), which isreferred to hereinafter as a DP step, constrains the maximum path slopeto be 2; thus the maximum replacement signal compression ratio will be2:1.

The key aspect of the DP step is that the best step to a new end pointat k=k_(e) is found by starting at the new end point and searchingbackwards to at most three previous best path ends at k=k_(e) -1 andconnecting the new end point to the path which generates the best (i.e.lowest) score. This is illustrated in FIG. 8 which depicts the allowedpaths in the (k,j) plane to a point (k,j) in the DP step. In particular,if a=0, signifying that a replacement frame is repeated (i.e. ahorizontal step in the (k,j) plane), or if a=2, signifying that a singlereplacement frame has been skipped (i.e. a lower diagonal step in the(k,j) plane), different (positive) penalities are included. For a=1(i.e. a diagonal step in the (k,j) plane) no penalty needs to beincluded.

Since there is no formal restriction on the amount of expansion byrepetition that the path can introduce, the penalty for a=0 is generallyset higher than that for a=2.

The basic means by which ZIP examines a number of path ends in parallelwill now be described below with reference to FIG. 9. Some features ofZIP will be omitted here for simplicity. Initially, as illustrated at(a) in FIG. 9, L consecutive values of j_(s) from j_(s) =1 to j_(s) =Lare taken at k_(s) =1 as the first elements of L different paths.Because this is the first step, these L consecutive values temporarilyalso define the end points of each path and can therefore be regarded asmaking up a window of elements for which certain data must be kept tocompute the DP step. Several data arrays are used to hold the requireddata. First, for each new possible path end within the window, acorresponding path score will be kept in a data array named SCORE. Thescores for the L different paths are all initially set to zero. Next,L+2 distances are independently computed betweeen the reference vectorat k=1 and the vectors of each of the first L+2 replacement frames fromj= 1 to j=L+2. These distances are held in a second data array nameDIST. The two extra distance measures are made available to enable theDP step to extend the path ending at (1,L) along the lower diagonal inthe (k,j) plane to (2,L+2). This action of extending the window by twounits of j at each step of k steers the top of the path explorationwindow up (towards higher j) at a maximum slope of 2:1 as is illustratedby the graphical representations of the (k,j) plane at (a), (b) and (c)in FIG. 9.

At the bottom of the window, i.e. below j=1, the non-existence of pathends and scores means that the DP step is restricted to not test the a=1or a=2 step when j =1 and, similarly, to not test the a=2 step when j=2.

Using the computed distances and the array of previous scores, ZIPcomputes a new best score independently for each of the L+2 newendpoints using the DP equation and, at the same time, saves in atwo-dimensional array of path elements named PATH the correspondingindex of j which provided each best step.

The index indicates the frame index from which the best step was made;therefore each index is actually a pointer to the previous frame's pathend. Successive pointers generate a path which can be traced back to itsorigin. The PATH array holds a multiplicity of such strings of pointers.

After the first DP step, the first column in PATH is simply filled withthe indices of j from 1 to L+2. This is illustrated in FIG. 9 where theportion of the (k,j) plane is shown at (a), (b) and (c) with brokenlines indicating an imaginary window around the endpoints of the pathelements in the previous step. The SCORE, DIST and PATH arrays are shownat (a), (b) and (c) with typical data held after the DP step has beenmade for k=1, 2 and 3 respectively.

Each element in the SCORE array corresponds to a unique path end andsequence of previous path elements which led to that score. Each uniquepath is held as a row in the PATH array with the same index as thecorresponding score in the SCORE array.

With reference to FIG. 9 again the following cycle of processes iscarried out. After the L+2 DP steps have been made and the new path endshave been saved in PATH, ZIP advances k to the next reference frame;computes a new set of distances between the new reference vector andeach of the vectors of the replacement frames that will be needed in theDP steps; extends all the paths using the DP step equation, the array ofdistances and the array of previous scores; and thereby generates a newset of scores and the next path end elements corresponding to the newbest scores. These path ends are appended to the appropriate pathelement sequences in PATH. This cycle, with the addition of some furtherprocessing to be described next, is repeated (as shown at (b) and (c) inFIG. 9) until the last reference frame is processed.

The choice of local path constraints in the DP step ensures that if thesteps are computed by starting from the newest entries in SCORE andworking backwards to the oldest entries, the paths cannot cross eachother. They can, however, trace back to a common segment, as will bedescribed hereinafter.

Without further processing, each path would grow in length by one unitfor each DP step, and the number of paths, scores and distances wouldgrow by two for each step, requiring a continually increasing amount ofstorage and computation which would be impractical for long signals.

ZIP avoids these problems by three different mechanisms:

A pruning technique effectively restricts the window dimensions bycontrolling both its top and bottom ends. For each reference frame,after the new set of scores and path endings are computed via the DPsteps, all scores which are more than a predetermined amount (thethreshold amount) away from the best score for that reference frame areomitted from further consideration. In addition, the path correspondingto each score thus pruned is also removed, and flags are set to preventunuseable distance measures from being calculated in the next DP step.As long as the difference between the score along the true optimum pathand the score of the currently optimum path (i.e. the path with the bestscore ending at the current frame) remains less than the thresholdamount, the optimum path is never pruned. During this pruningcomputation the computed best score found for each input frame is setequal to the negative of the threshold value and the remaining scoresare computed relative to this one, so that the range of possible scoresis reduced considerably.

The possible maximum length of the paths is restricted to somerelatively small number (e.g. 50) by maintaining sufficient storage tohold paths for as many reference frames as needed to allow the pruningto establish agreement between, i.e. convergence of, the startingelements of the remaining path segments on the optimum path for one ormore path elements. The elements common to the remaining paths may thenbe output as w(kT) and the storage units in PATH which held these valuescan be released for further use.

The third mechanism to reduce storage is implementation of the score anddistance arrays as circular (or "ring") storage areas. Thetwo-dimensional path array is implemented to be circular in each of itstwo dimensions, and acts as a two-dimensional window which moves overthe (k,j) plane substantially diagonally, containing the path segmentsunder scrutiny, among which is the optimal one.

However, the recording conditions for film guide track signals areusually considerably different (e.g. noisy, reverberant, distantmicrophone placement) from those for a studio-recorded dub. Proceduresused to find the distances between the reference and replacement vectorsmust therefore minimize the effects of these long-term signaldifferences but ZIP does not ensure this. Furthermore, the time warpingpath slope constraint in ZIP restricts tne maximum compression of thereplacement signal to a ratio of 2:1, which can cause the computed bestpath to omit replacement frames in a segment of the replacement signalcontaining speech if this segment follows silence whose duration is morethan twice that of a corresponding silence in the reference signal. Thedesired algorithm response is to allow silence in the replacement signalto be expanded or compressed with far fewer restrictions than speech.

These shortcomings are overcome in the preferred embodiment of thepresent invention by modifying the ZIP algorithm. The modifications relyupon three assumptions concerning the nature of the guide track and dubspeech signals. (1) That in the first few seconds of input there aresome frames in both signals in which speech is not present, so that,since parameter vectors represent spectral cross sections, the lowestoutput values from each filter band are produced from samples of thebackground noise. (2) That the guide track and dub signals (inconditions of signal-to-noise ratios in excess of 20 dB) nominallycontain similar speech sounds, so that maximum levels reached incorresponding frequency bands should correspond to roughly the samespeech sounds and should consequently provide reference levels fornormalizing the spectral levels of these bands. (3) That the dub signalis input under nearly ideal (i.e. high signal-to-noise ratio)conditions, so that it is easy to detect whether or not a dub framecontains speech or background noise, whereas in contrast, the guidetrack signal may be heavily degraded by noise and unwanted signals.

The modified ZIP algorithm used in the preferred embodiment generatesthe time warping path by processing the parameter vectors on acontinuous basis in three stages of processing. The first stage is aninitialisation process which must be performed at least once. The maintime warping path generation takes place in second and third stages.

In the first stage, illustrated in block form in FIG. 10, a large numberof frames of both guide track and dub parameter vectors occupying 2 to 3seconds, i.e. 200 to 300 frames, are analysed to produce estimates oflong-term signal characteristics needed in the second and third stages.This long-term data is produced for each component of the parametervectors. The first stage, which is in effect a first processing passover some of the data, must be performed once before the main processingbegins. In addition, it may be performed relatively infrequently (forexample, once in every two or more seconds or in response to detectedchanges in signal characteristics) in order to update the long-termquantities.

In the second stage, illustrated in block form in FIG. 11, the dubparameter vectors are processed on a frame-by-frame basis (unlike in thefirst processing stage) in several different operations which utilizethe first stage long-term data to: (a) classify the dub frames ascontaining speech or silence: and (b) carry out some of the processingwhich removes long-term spectral differences between the correspondingguide and dub bands, and equalises the useable (i.e. noise free) dynamicranges. In addition, a number of working arrays for data are loaded withtime-varying data related to the dub frames in readiness for use in thethird stage. This time-varying data varies according to whether therespective dub frame classification is speech or silence and includes:(a) the preprocessed parameter vectors, which are resampled at twice theperiod of the original rate where successive dub frames are classifiedas silence: (b) the corresponding dub frame index numbers: (c)classification (speech/silence) indicators: and (d) the two penalties tobe used in the Dynamic Programming pattern matching step.

In the third stage (illustrated in block form in FIG. 12), which is alsoperformed once for each frame, an algorithm processes the data producedin the second stage and computes a number of potentially optimum timewarping path segments for aligning the dub frames to those of the guidetrack. In further processing, the algorithm saves a limited number ofthe computed best of the paths and then, when these remaining pathsegments satisfy certain conditions (related to the uniqueness of theirorigins) the algorithm outputs a unique path segment which represents(when speech is present in the dub) the optimum path for time alignment.Alternatively, when silence is present in the dub for relatively largeperiods, a path is generated in such a way that dub silence may becompressed at a maximum rate of 4:1 by omitting frames, or extendedindefinitely by repeating frames in the search for a best match of dubspeech to the guide track signal.

Details of the First Stage

As indicated in detail in FIG. 10, the first stage provides a variety ofnon-time-varying data to be used in both the distance computation andthe classification of dub frames as speech or silence. Firstly, in orderto remove differences between the guide and dub filterbank outputs thatare attributable to differences in recording conditions, linear gainadjustments, and background noise spectra and which therefore are notrelated to differences in the speech spectra alone, a normalization ofspectral levels and dynamic ranges is provided. In the presentembodiment, this normalization is implemented by producing a lookuptable for mapping each frequency band output range of the guide to thatof the corresponding dub band. Secondly, a noise floor lower limit isset for each dub band. Thirdly, since in measuring differences betweentwo spectra those differences occurring in the vicinity of spectralpeaks should be emphasized and less emphasis placed on spectraldifferences at low levels, a table of weighting function values (to beaccessed in the third stage) is prepared for each band. The input tothis table will be the maximum of the guide or dub spectral level, andits output will be the appropriate value to use in the spectraldifference weighting function. These preceding procedures are related tothose outlined in the paper entitled "A Digital Filter Bank for SpectralMatching" by D. H. Klatt in the Proceedings of the InternationalConference on Acoustics Speech and Signal Processing, at pages 573-576,published in 1976.

The input value A_(i) (kT) of a guide parameter vector component (i.e. alog-coded bahdpass output) in one frame will now be referred to as g₋₋in and similarly, a dub input component B_(i) (jT) as d₋₋ in. A specificband and frame is implied by g₋₋ in and by d₋₋ in. To accomplish thefirst stage processing, the following processing steps are taken foreach frequency band in the dub and guide track separately (unless statedotherwise)

1. Using the first 200 frames of g₋₋ in, make a histogram (see FIG. 13at (a)) in 1 dB--wide bins over the input range from 1 to 100 dB of thenumber of occurrences at a particular input level versus the input level(Blocks 63 and 64 in FIG. 10). Similarly, make a histogram of the samenumber of frames of d₋₋ in. (Blocks 65 and 66, FIG. 10).

2. Find the lowest bin (i.e. lowest input level in the histogram) whichcontains more than one entry and which is also not more than 6 dB belowthe next highest bin containing more than one entry. Identify thislowest bin as low₋₋ min.

3. Find the noise floor peak in the histogram by searching incrementallybetween low₋₋ min and low₋₋ min+15 (dB) for the histogram bin at whichthe sum of the contents of the test bin and the two adjacent (upper andlower neighbour) bins is a maximum. Identify the bin at which thismaximum first occurs as low₋₋ peak. This value is used in steps 4 and 6below.

4. For the dub only, set a speech/silence threshold value at the low₋₋peak+12 (dB). This value is referred to as d₋₋ sp₋₋ thr and is used inthe third stage. (See Block 74, FIG. 10).

5. Determine an average histogram maximum value by the followingprocedure:

(a) Starting from the highest bin (100 dB), search down towards thelowest bin for the first (i.e. highest) bin in which there are at leastthree entries or for the first bin in which there is at least one entryand within 3 dB below this bin there is another bin with at least oneentry. Mark the highest bin meeting this criterion as high₋₋ max.

(b) Beginning at high₋₋ max, sum the contents of this bin andsuccessively lower bins until 5% or more of the histogram distributionhas been accumulated (e.g. 10 entries if 200 frames are beingprocessed). This corresponds to 5% of the total histogram area. Mark thebin at which this condition is met or surpassed as high₋₋ min.

(c) Subtract from high₋₋ max the greatest integer part of (high₋₋max-high₋₋ min+1)/2 to obtain the bin value which will be marked as g₋₋high₋₋ avg for the guide track band and d₋₋ high₋₋ avg for the dub. Therespective values should mainly be related to the highest (but notnecessarily peak) histogram values for the bands and should not bestrongly affected by a small number of brief impulses that are higherthan speech signal values. These values are used in steps 6 and 7.

6. Create a lookup table for use in the third stage that maps the guidetrack input range of values to a new set of values such that thelong-term spectral differences between the dub and guide are removed andsuch that any input value falling below a computed noise floor common toboth the guide and dub does not contribute unreliable information to thespectral distance calculation. This latter aspect removes the risk ofobtaining an unwanted large dissimilarity score between a speechspectral cross section that is noise-free and a spectral cross sectionof the identical speech signal that is `noise-masked` (i.e. corrupted byadditive noise with spectral density higher than that of some of thecorresponding speech bands). Table values are calculated by generating afunction of the guide input values according to the following steps:

(a) Set a noise floor level at 4 dB above the value low₋₋ peak in thisband. Set g₋₋ nflr to this value for the guide band and similarly setd₋₋ nflr to the corresponding value for the dub band. (See blocks 67 and68 in FIG. 10).

(b) Compute a band dynamic range by subtracting the appropriate (dub orguide) noise floor level from the corresponding value of high₋₋ avg. Setg₋₋ range to the value for the guide track and d₋₋ range to the valuefor the dub. (See blocks 69 and 70 in FIG. 10).

(c) If g₋₋ range is less than d₋₋ range, then compute a new mapped dubnoise floor level, map₋₋ d₋₋ nflr, equal to d₋₋ high₋₋ avg--g₋₋ range.If g₋₋ range is greater or equal to d₋₋ range, set map₋₋ d₋₋ nflr equalto d₋₋ nflr and set g₋₋ nflr equal to g₋₋ high₋₋ avg--d₋₋ range. (Seeblock 1, FIG. 10). The variable map₋₋ d₋₋ nflr is used in the secondstage as a lower limit on input dub values.

(d) Compute entries for the table that converts raw guide track values,now referred to as g₋₋ in, into output values according to the followingfunction: ##EQU5## The expression (map₋₋ d₋₋ nflr-g₋₋ nflr) provides aconstant range offset to compensate for the differences found betweenthe top levels of the dub and guide signal ranges. (See block 72, FIG.10).

7. Create a further lookup table for use in the third stage that mapsinput values of the normalized dynamic range found in step 6c intovalues v (where v=0,1,2 or 3) for use in weighting the spectral distancemeasures that will be computed in the third stage. In the third stagethe weighting function is implemented by multiplying the raw spectraldifference in one band by a function 2^(v) (|) where | is the input tothe table found by taking the maximum of d₋₋ in and the mapped g₋₋ in.(See block 73, FIG. 10). The steps used to create the table of v(|) areas follows:

(a) Divide the minimum of g₋₋ range and d₋₋ range by n₋₋ div, which is anumber of range division, and take the greatest integer value part ofthe result as the division increment, div₋₋ inc;

(b) For input values of | from 1 to 100, compute entries for the tableof v(|) according to the following function: ##EQU6## The aboveprocedure divides the common dynamic range into n₋₋ div steps and inputvalues above and below this common range are mapped to the highest andlowest values of v, respectively. To obtain a greater (or lesser) rangeof weights, n₋₋ div may be increased (or decreased) and a functionsimilar to that above may be used to obtain the new v(|).

The second and third stages of the Time Warping Processor (TWP)generating algorithm will be described next. Some of the most importantvariable and arry definitions are listed now.

    ______________________________________                                        Variable       Definition                                                     ______________________________________                                        DSF            Dub start frame (number): used at start                                       of second stage.                                               DSTOPF         Dub stop frame number.                                         NWDF           Number of working dub frames: defines                                         the number of slots of dub frame data                                         held in each dub-related array.                                NDFR           Current number of dub frames read in                                          and processed so far in second stage.                                         Also indicates the number of  .sub.-j of the dub                              frame being processed in the second                                           stage.                                                         GSF            Guide track start frame number. (=1).                          GSTOPF         Guide stop frame: initiates shutdown of                                       TWP activity.                                                  NCGF           Number of current guide frame being                                           processed.                                                     HPENSI         Horizontal DP step penalty for dub                                            frames classed as silence.                                     HPENSP         Horizontal DP step penalty for dub                                            frames classed as speech.                                      LDPNSI         Lower Diagonal DP step penalty for                                            dub silent frames.                                             LDPNSP         Lower Diagonal DP step penalty for                                            dub speech frames.                                             TH             Threshold used in pruning DP scores.                           MAXRPT         Maximum number of frames of hori-                                             zontal path growth allowed before                                             silence pruning is attempted.                                  PE             Path end column in path array.                                 PSTART         Path start column in path array.                               Array Dimension                                                               MNDF           Maximum number of dub frames held                                             in arrays. Typically MNDF = 50                                 NPAR           Number of parameter vector elements                                           used.                                                          MXPATH         Maximum length of path segment held                                           in path array.                                                 Array                                                                         DCLASS(MNDF)   Dub classifications (speech or silence).                       DFRNUM(MNDF)   Dub frame numbers corresponding to                                             .sub.-j's.                                                    DIST(MNDF)     Spectral distances between each dub                                           frame parameter vector in DSTORE and                                          current guide parameter vector.                                DSTORE(NPAR,MNDF)                                                                            Dub parameter vector working store                                            holding NPAR elements per dub frame.                           HPEN(MNDF)     Horizontal penalties to be used in DP                                         steps.                                                         LDPEN(MNDF)    Lower diagonal penalties to be used in                                        DP steps.                                                      HSU(MNDF)      Horizontal DP step-used-in-speech flags.                       PATH(MXPATH,MNDF)                                                                            Best partial path up to each end point.                        SCORE(MNDF)    Accumulated score for each partial                                            path.                                                          ______________________________________                                    

In FIG. 14 the activities of the three processing stages are illustratedin relation to each other in a flow diagram of the entire time warpingprocess, in which the first, second and third stages I, II, and III arerepresented by blocks 75, 76, 77 and 78. Before FIG. 14 is described,the method of processing guide track and dub filter bank outputs isexplained. In the following explanation it should be noted that guideand dub filterbank output values are readily and continuously availablefrom a buffer memory, and that at the end of the guide signalparameterization, the variable GSTOPF will be set to the last guideframe number. The signal which initiates the setting of GSTOPF isderived by means discussed later. Before the algorithm is started,GSTOPF is initialized to some arbitrarily large value never to bereached in operation. In addition, to enable the system to handleproperly a replacement signal whose duration extends beyond that of thereference signal, the parameterization and storage of the dub signalshould continue for a duration sufficiently long to contain a signalending which substantially resembles the (possibly earlier) signalending in the reference signal. This safety measure can be accomplishedfor example by deriving a further variable, DSTOPF, by adding a fixednumber of frames (e.g. 200 or two seconds of frames) to GSTOPF whenGSTOPF becomes known, and then allowing the dub processing to continueup to the moment in time corresponding to the end of this frame. Thevariable GSTOPF is used to end processing activity of the second andthird stages II and III, whereas DSTOPF is used to terminate the inputand parameterizing of the replacement signal, and to mark the end ofavailable replacement data during the processing.

The use of circular arrays is implied in all further discussions, butthis is not necessary for very short signals.

Before any of the processing represented by FIG. 14 begins, the user mayselect (or adjust) the values of the DP step penalties (HPENSI, HPENSP,LDPNSI, LDPNSP), the pruning threshold (TH), and dub silence framerepeat count threshold (MAXRPT). These values are generally determinedexperimentally and are dependent on the output range of the parametervector generating processes and frame rates.

At a given signal (generated upon loop entry), the parameter generatorprocessor is started (block 79). Once a sufficient number of raw guideand dub parameter vectors are available (decision 80), STAGE I (block75) is enabled and produces the threshold variables, and mapping andweighting function arrays described hereinbefore. STAGE II (block 76) isthen used to preload the arrays as shown in FIG. 11 up to their maximumlength or to the last dub frame, whichever is smaller. Next, STAGE IIIis initialized at A by resetting all relevant counters and clearing orsetting array elements. Finally the main processing loop is entered andrepeated for each guide frame. In each pass through this loop a STAGE IIload (block 77) is attempted (but may not be made if the oldest slot inthe dub arrays still contains a potential path candidate). Also in thisloop, STAGE III processing (block 78) takes place in which parallel DPsteps are made for each active path, and also an attempt is made tooutput a unique best path or a segment of silence. When the last guideframe is processed, the remaining path segment with the best score isoutput, and the time warping process is finished.

The second stage of the time warping process is represented in detail inblock form in FIG. 11 and in a flow diagram in FIG. 15. This stagepre-processes the dub filterbank outputs and loads time-varying datainto arrays for use in the DP step which takes place in the third stage.Decisions and processing affecting how the data is prepared are partlybased on some of the long-term data derived in the first stage.

The relationships between the input dub filterbank data and the dataloaded into the arrays DSTORE, DCLASS, LDPEN, HPEN, and DFRNUM are shownfunctionally in FIG. 11. The arrays (of dimension NWDF) are treatedcircularly and are loaded at the same array row index once for each dubframe classified as speech, or once every other frame when consecutivedub frames are classified as silence. The classification of the dubframe (taking place in the block 79 inscribed CLASSIFY: SPEECH/SILENCE)is based upon a simple decision algorithm whereby if any two out of thefour raw input dub bands are above the respective thresholds for thosebands (set in the first stage, i.e. d₋₋ sp₋₋ thr), the frame isclassified as containing speech. Otherwise it is classified as silence.In the block 80 inscribed CLIP LOWER RANGE, each band of the raw dubfilterbank values is compared with the corresponding mapped noise floor(map₋₋ d₋₋ nflr determined in the first stage) for that band. If the rawvalue of the band falls below the map₋₋ d₋₋ nflr of the band the rawinput value is replaced by map₋₋ d₋₋ nflr which is loaded into theappropriate slot in DSTORE. Any dub band value above the correspondingmap₋₋ d₋₋ nflr is loaded without modification into DSTORE. This step ispart of the total operation which eliminates the possibility of noisemasking, and equalises the guide and dub band dynamic ranges.

In a block 81 inscribed SELECT LD-PENALTY and HZ-PENALTY, the user-inputvalues for the penalties to be added for non-diagonal DP steps (in thethird stage) are selected, based upon whether the corresponding frame isspeech or silence. By using very small penalties for silence frames ascompared with the penalties for speech frames, the path will be muchmore flexible during dub silence, which is a desirable effect. The lowerdiagonal penalty is made slightly negative so that best paths in dubsilence can be biased towards a slope of 4:1 during low level guidesignals, which is useful for compressing long gaps in the dub whennecessary.

Another block 82 inscribed INCREMENT DUB FRAME COUNT is shown whichproduces the appropriate frame numbers to be loaded into the arrayDFRNUM for later use in producing the correct time warping path steps inthe third stage.

Finally, a block 83 inscribed SELECT SAMPLING RATE increases thesampling rate of the dub frame data (via a block 84 inscribed SAMPLE ANDINCREMENT INDEX) when the current and previous dub frames are classifiedas silence. Otherwise the sampling rate remains 1:1. The particularalgorithms used to implement these functional blocks are illustrated inthe flow diagram of FIG. 15 and include decisions 91, 92, 93 and 94operating on dub class DCL, next dub class NXTCLS, and previous dubclass PRVCLS. Before this stage is used, the variable NXTCLS isinitialised to UNKNOWN, and PRVCLS to SPEECH.

Details of the Third Stage

In the third stage of the time warping process, a Dynamic Programming(DP) algorithm is used with window steering and path pruning based onthat of the ZIP algorithm, along with an added horizontal path steprestriction and a silence pruning operation, to produce a best timewarping path and corresponding frame classifications for input to thesignal editing process. FIG. 12 illustrates the major processingoperations and their relationship to the data structures definedpreviously. FIG. 16 summarizes the primary operations in flow diagramform. These operations are performed sequentially, and begin (see FIG.14) after the required number of dub frames have been processed in thefirst and second stages.

During the second stage, the array DSTORE is filled with processed dubparameter vectors that may have been reduced in their dynamic range bythe range normalization operation in the second stage. The dub parametervectors in DSTORE are not necessarily strictly consecutive, owing to thepossibility that the sampling rate may have been increased. However, foreach dub frame parameter vector in DSTORE the appropriate penalties tobe used in the DP step and the classification and dub frame number to beused in updating the paths are held in the arrays LDPEN, HPEN, DCLASS,and DFRNUM respectively. All elements of the PATH array are generallyinitialized to 0, and the upper half of the SCORE array is given arejection code while the lower half is set to a zero score. Therejection code is used to identify elements which need not be furtherprocessed. Additionally, all elements of the array HSU are set tological false.

The array HSU is used to introduce a restriction on the number ofconsecutive horizontal steps allowed along any path with framesclassified as speech. Referring to FIG. 8 and the DP step equation, thea=0 step is allowed to be used once only for any frame that isclassified as speech. In this way a minimum path slope of 1/2 (i.e. anexpansion factor of 2) is permitted during speech.

As illustrated in FIG. 12 and FIG. 16, the following operations areexecuted once for each pass through the processing loop shown in FIG. 14(i.e. once per guide frame).

1. Update the path end pointer PE (block 95, FIG. 16).

2. Get the next raw guide parameter vector from the buffer and map eachcomponent through the corresponding g₋₋ to₋₋ d₋₋ maps. This is carriedout in a block 85 inscribed RANGE NORMALIZE AND LIMIT.

3. Compute the weighted spectral distance measure between the normalizedguide frame parameter vector and each dub frame parameter vector inDSTORE that is required in the exploration window in the next set ofparallel DP steps. These distances are put into the corresponding slotsin DIST. This operation takes place in the block 86 inscribed COMPUTEWEIGHTED SPECTRAL DISTANCE.

4. For each active score and path in the current search window, computethe DP step using horizontal step restrictions, penalties, scores anddistances at the appropriate indices of the arrays HSU, LDPEN, HPEN,SCORE and DIST respectively, to find the path element producing the bestscore. Update the path end in the PATH array at PE with thecorresponding dub frame numbers (from DFRNUM)and the SCORE array withthe best scores. In addition, mark the path element with theclassification of the dub frame. Set or clear any horizontal pathrestrictions as appropriate. These operations all take place in aprocess block 87 inscribed DP STEP.

5. Prune (i.e. reject) paths with scores more than the threshold value(TH) away from the best score in SCORE, and put a rejection code in eachelement of SCORE that has been pruned. The remaining (unrejected) scoresdefine the search window that will be used to extend the paths in thenext DP step. This operation takes place in a block 88 inscribed PRUNEBAD SCORES & CORRESPONDING PATHS.

6. If the paths remaining in PATH trace back to (i.e. agree on) a commonpath segment, output that path (and corresponding speech/silence markersin the path) up to the point of divergence of the path, and clear thecommon path elements from PATH. This takes place in the block 89inscribed DETECT AND OUTPUT UNIQUE PATH ELEMENTS.

7. If the classified path segments remaining in PATH indicate that theexploration window has been passing through a region of dub silence andrelatively featureless region of the guide frames for more than MAXRPTframes, output the best scoring path (and corresponding classifications)up to the last element, remove all other paths, and restart the DPalgorithm at the remaining path end element. This operation is carriedout in the block 90 inscribed DETECT AND OUTPUT PATH SEGMENT IN DUBSILENCE.

8. If the last guide frame has been processed (indicated by GSTOPF),find the remaining path segment with the best score and output it. (Thisstep is not shown in FIG. 12). This action terminates the time warpingprocess.

For the preceding operation number 3, the process for computing theweighted spectral distances, the spectral distance weighting factor(introduced previously) is defined in spectral band i as

    r.sub.i (kT)=2.sup.v.sbsp.i.sup.(|.sbsp.i.sup.)

in guide frame k, where |_(i) is the maximum of the ith mapped guideband value and the ith normalized dub band value from DSTORE. Theresultant value of |_(i) is used as an index to the array of weightingvalues v_(i) (|_(i)) for band i and a power-of-two weighting of theabsolute values of the difference between the ith dub and guide bands iscomputed to obtain the contribution of the ith component to the totalspectral distance.

The additional data path leading to this process block from the scorearray allows sensing of rejection codes marking elements that have beenrejected or are not active in the current search window, so thatunnecessary distance calculations can be prevented from being carriedout.

The operation number 6 can be implemented simply as follows. First, thecolumns of PATH which contain the first and last elements of theremaining path segments must be located. Call the index of the columncontaining the start of the current path segments, PSTART, and thecolumn which contains the end elements of the current path segments, PE.Given a total number of columns in the path array of MXPATH, employ thefollowing algorithm, which is presented in a pseudo-programming languagefor convenience. Note: indicates a comment, and questions and theircorresponding answers (i.e. Yes and No) are equally indented.

    ______________________________________                                        i = PSTART       set column pointer index  .sub.-i                            1 Is the same element in all remaining paths at i?                            Yes:             Path is unique in this column.                               Output path element and classification.                                       Mark all entries in column  .sub.-i with output/rejected                      code = 0.                                                                     i = i + 1.                                                                    If (i MXPATH) set i = 1.                                                      If (i not equal PE) go to 1.                                                  Go to 2.                                                                      No:              Paths diverge in this column                                 Has anything been output ( .sub.-i not equal PSTART)?                         No:                                                                           Is the path array full?                                                       Yes:                                                                          Output the oldest path element with the best score.                           Remove paths disagreeing with element                                         that was output.                                                              Put rejection code in score assay for removed paths.                          i = i + 1.                                                                    if (i MXPATH) i = 1.                                                          Go to 2                                                                       No:                                                                           Go to 2                                                                       Yes:                                                                          Go to 2                                                                       2 PSTART = i     Take current column (with                                    Return.          possible path divergence) as                                                   new PSTART for next pass.                                   ______________________________________                                    

The operation number 7 appears to be unique to this implementation andwill now be described in some detail. The reason for including thisoperation arises from a consideration of the DP path production stepsused, and will be explained with reference to FIG. 17, which is aschematic representation of typical contents of the path array after theDP algorithm has stepped through several frames of low level guidesignal (at or near the guide noise floor) and the corresponding dubframes have been classified as silence. The fact that the guide framesare at low levels means that the spectral distance measures between theguide and silence dub frames will be very low or 0, and thus give riseto a lack of features in the incoming distance measures and scores whichwould otherwise normally provide sensible window steering.

The positions of the dub frames which are stored in DSTORE are indicatedon the vertical axis of FIG. 17 by dots, and it is seen that dub framesat alternate j values are used during silence. The paths produced duringthe DP steps in silence generally have a slope of 4:1 due to the bias ofthe DP step towards the lower diagonal during frames of dub silence.However, during these steps, the scores for each path are eitherdecreasing or increasing very little (because of the low penaltiesused), in order to allow silent regions to have very flexible paths.Consequently the scores of the worst scoring paths will only beincreasing marginally and thus these paths will not generally be prunedby the fixed threshold pruning operation during dub silence. The numberof paths will increase at a rate of two per guide frame and thusintroduce a heavy and unnecessary computational burden unless removed.Accompanying this lack of pruning in dub silence are the facts that 1)the lowest path (e.g. from d to e in FIG. 17) usually has a growingnumber of repeated frames and 2) the fastest rising path (e.g. from a toc in FIG. 17) has a slope of nearly 4:1 for the section of the pathcorresponding to the repeated frames in the lowest path (i.e. from b toc in FIG. 17). These facts result in a triangular path beamcharacteristic of the shape of path exploration during dub silence withthe classification-dependent DP algorithm implemented.

Because some of the penalties are negative, the best score does notnecessarily indicate the optimal path but is likely to do so. Mostimportantly, the path taken through this region is generally arbitraryso long as the spectral distance measures do not indicate that speechhas been encountered at points such as c or e in FIG. 17 which would bemanifested in score changes sufficiently large to activate the pruningand window steering described previously.

Although it is not certain where the optimal path will be required to goin the next step (at PE+1) there is nonetheless a best choice of path tobe made in view of the properties of the current DP algorithm.Generally, the best path to take is the one which has the best score.However if the procedure described hereinafter is implemented, the pathwith the best score will be the fastest rising path in most cases. Fromthe example of FIG. 17 it can be seen that if the next guide frame tocause a path extension at PE+1 were speech, and if the next dub frameafter c were the speech frame corresponding to the next guide frame, thehighest path shown would have compressed a gap of silence nearly fourtimes longer than that in the guide. Alternatively, if the dub and guidespectra continued to be featureless, there would be no loss inexploration ability from abandoning all paths but this highest one andrestarting the DP algorithm from point c since the DP algorithm willcontinue to explore simultaneously paths which repeat the dub frame at cand paths rising at a rate of 4:1 from c. This procedure therefore caneffectively find a path through any dub silence gap of t_(g) in durationand fit it to a corresponding gap in the guide track of any durationfrom t_(g) /4 to infinity.

The technique and algorithm used to detect and output dub silence in theconditions described above will now be described. Defining the number ofrepeated frames back from PE along the lowest path (not counting thefirst one as a repeat) as RPTCNT, then the maximum number of verticaldub frame steps that could be taken if the highest path were steppingthrough a region of dub silence is RPTCNT multiplied by 4. However, itis not expected that every step will necessarily be a 4:1 step, and itis better to define a rise of a threshold number of dub frame units of jbased on an average slope less than 4:1 that allows a few smaller stepsto be included in the fastest rising path and also still allows themaximum rise to be an indicator of a dub silence region. We have foundthat an average slope of 3.4:1 is a reasonable indicator that the pathis rising through silence. The algorithm which follows is againdescribed in a pseudo programming language.

Count the number of repeated elements in the lowest path in PATH backfrom PE.

Take this number as RPTCNT.

    ______________________________________                                        Is (RPTCNT > MAXRPT) ?                                                                           Has a sufficiently long gap                                                   been explored?                                             No:                                                                           Return.                                                                       Yes:                                                                          Calculate a minimum number of frames (MNFRMS) that                            the path would rise in RPTCNT frames if the upper                             path was not finding any significant features.                                MNFRMS = 3.4 * RPTCNT.                                                        Calculate the actual span NSPAN in frames in the upper                        path between the dub frame number at PE and the                               dub frame at PE - RPTCNT.                                                     Is (NSPAN > MNFRMS) ?                                                         Yes:                                                                            The the area explored has been featureless                                  Find the best score and output hte corresponding                              path up to but not including the element at PE.                               Clear all path elements but the end of the best path                          at PE.                                                                        Put the rejection code in all SCORE elements but                              the best.                                                                     Return.                                                                       No:                                                                           Return                                                                        ______________________________________                                    

DUB EDITING PROCESSOR

The purpose of the processing block 49 inscribed GENERATE EDITING DATAin FIG. 6 is to use the time warping path and correspondingspeech/silence classifications of the path elements as grossinstructions for editing the dub waveform which is stored on the disc,and to also derive exact editing instructions (when required) from pitchperiod data and dub waveform details. The final editing of the waveformis carried out in the process block 51 inscribed EDIT WAVEFORM, whichsimply fetches the signal in segments defined by the editing data, andpieces together an edited dub waveform which has the followingproperties: (1) For every frame of the time warping path, approximatelya frame-length segment of the dub waveform in the region of timeindicated by the warping path is output: (2) For each frame classifiedas silence in the warping path, a frame-length period of true (digitalzero) silence may be output to replace the dub waveform: (3) deletionsor repetitions of dub waveform frames (as specified by the time warpingpath) are carried out pitch-synchronously in voiced speech--that is, thedeleted or repeated waveform segment is the integral number of pitchperiods in length which best satisifies the requirements of the warppath and last output sample loaded: and (4) endpoints of anynon-adjacent waveform segments that are joined together are matched toeliminate any perceived discontinuities.

Examples of the operations referred to hereinbefore at (1) and (2) inthe preceding paragraph are represented in FIG. 18. For every guideframe k there is a dub frame j=w(kT). In FIG. 18 a path w(kT) is shownin the (k,j) plane as a series of connected dots which if open indicatethat the dub frame has a silence classification and if calcalated as:(e.g. a speech frame can be repeated once only, and no more than onespeech frame can be skipped in any one step); this simplifies theediting process considerably. Adjacent to the j axis a typical dub timewaveform, x₂ (t'), is represented graphically with each dub frame numberj aligned at the end of a frame period of T seconds, thereby fixing thecorrespondence of the waveform segments to the frame numbers. At pointsin the path w(kT) where frames of j are skipped, an "X" marks a waveformsection for deletion. Similarly, double arrows mark a segment forrepetition.

The dub waveform segments are projected down to the time axis t"adjacent to the k axis (as typified by the segment marked out by brokenlines) to reconstruct graphically (ignoring any discontinuties) anedited x₂ (t'), which is labelled x₂ (t"), from the selected waveformsegments and from digital silence (i.e. zeros). The discontinuitieswhich would result from such a reconstruction would be perceptuallyunacceptable. Therefore, the following technique alleviates this problemand still maintains a close tracking of the time warping path as primaryediting data.

The following quantities are defined for use in describing the editingprocess:

Constants:

SMPRAT--The sampling rate of the stored dub waveform.

LENFRM--The length of a frame of waveform in samples.

ETIS--The edit threshold in samples (=LENFRM/2)

Frame Rate Variables:

NG--(Current) guide frame number (corresponding to k).

ND--(Current) dub frame number (corresponding to j) obtained from warppath in frame NG.

DCL--Dub frame ND's classification.

PRVND--Previous dub frame number from warp path at NG-1.

PRVDCL--Previous dub frame PRVND's classification.

Sample Rate Variables:

TESIIW--Target end sample in input (unedited dub) waveform.

LESIIW--Load end sample in input waveform.

TESIOW--Target end sample in output (edited dub) waveform.

LESIOW--Load end sample in output waveform.

INCSMP--Increment in samples from previous to current input waveformtargets.

DEV--Deviation in samples between the output waveform end sample andtarget end sample that will result if the next frame is loaded withlength LENFRM after the current LESIOW.

The basic operations involved in editing are shown in the form of a flowdiagram in FIG. 20(a), (b) and (c).

As seen from the example of FIG. 18, the time warping path w(kT) definestwo sets of target endpoints in samples of waveform segmentsLENFRM=T*SMPRAT samples in length. (See also FIG. 20(a)). The first ofthese is the target endpoint sample number in the output (edited)waveform, where a segment at guide frame NG (=k) is to end. Thus, ifsignals begin at sample one, the kth frame number specififes that theend of the kth segment, LENFRM samples long, would be at sample numberk* LENFRM in the output waveform. For a particular frame k, the targetendpoint sample in output waveform is referred to as TESIOW. Similarly,the dub frame number ND=j, obtained from the warp path as j=w(kT), alsospecifies an input (unedited) waveform segment endpoint at sample numberj*LENFRM. For a particular frame of j, the target end sample in inputwaveform is referred to as TESIIW.

If the editing process were to simply produce an output waveform asexemplified in FIG. 18, the difference (TESIIW-TESIOW) would be unlikelyto equal 0 for any frame. Therefore the editing process is designed toattempt to fetch consecutive segments of the input waveform until thedeviation between the actual endpoints and target endpoints in theoutput and input waveforms would become greater than some predefinedthreshold value. The editing process can then load segments which do notnecessarily end on segment boundaries defined by the sequence of TESIIWsand concatenate these segments to form an output waveform in which theend samples in each load segment do not necessarily fall on segmentboundaries defined by the sequence of TESIOWs. To compute this runningdeviation, two further variables must be introduced.

The first, LESIOW, of these further two variables refers to the actuallast load end sample in output waveform, and is the sample number foundat the end of the last loaded segment, counting from the first sampleinclusive, of the output signal. Similarly, the second, LESIIW, refersto the load end sample in the input waveform and is the number of thesample last loaded into the output waveform signal buffer, counting fromthe first input sample inclusive.

With these four variables TESIOW, TESIIW, LESIOW and LESIIW it ispossible to find the deviation from the "target" waveform defined byw(kT) that would exist after any input waveform segment is loaded intoany location in the output waveform. This deviation, defined as DEV iscalculated as:

DEV=TESIIW-TESIOW+LESIOW-LESIIW, as indicated in block 96 of FIG. 20(b),and provides a number (in samples) which is positive if the last loadedwaveform end sample is beyond its targeted position in the outputbuffer. Similarly, DEV is negative if the last loaded waveform endsample falls short of its targeted position in the output buffer. Giventhat the deviation can change each k if w(k)=w(k+1)-1, the outputwaveform is assembled frame by frame, and the deviation is computedbefore each new segment is loaded. If the magnitude of the deviationthat would result from loading the next LENFRM samples after LESIIW intothe position in the output waveform following LESIOW is greater than amaximum permissible deviation defined as ETIS (edit threshold insamples), an editing operation is applied, as illustrated by FIG. 20(c)following a YES answer to a decision 97 in FIG. 20(b).

In segments of dub waveform classified as speech the editing operationsmust be done pitch synchronously if the segment is found to containvoiced speech, and the required operations are described below. Withreference to the example in FIG. 19, the input waveform (unedited dub)shown at (a) represents periodic speech on an axis numbered in samplesevery LENFRM=100. In FIG. 19 at (b) the target end samples are shown,and a typical skip of 100 to 300 is indicated for TESIIW, whereas TESIOWdoes not (and cannot) make this jump. If the deviation for the firstload is tested using LESIIW=100 and LESIOW=100, then DEV=0. Therefore,no editing is required and this segment is loaded into the output bufferas shown at (c) in FIG. 19. However, in the second frame, if a load weremade with LESIIW=200, then with TESIIW=300, TESIOW=200 and LESIOW=200,DEV=100, which indicates a skip must be made to reduce DEV below thethreshold of TH=50.

The general procedure taken to make this edit is as follows:

(1) The next three frames following the current LESIIW (at sample q in(a) of FIG. 19) are loaded into the output buffer after LESIOW (at q')for examination. (See block 8, FIG. 20(c)). This extra segment in theexample is from point s to point u in the input buffer and is shownloaded in the output buffer from s' to u'.

(2) The period of the waveform over the current and next frame ismeasured using the waveform in the output buffer, and the result (insamples) is assigned to the variable PERIOD. (See block 99 in FIG.20(c)). The computational method used to find the period is that of theAverage Magnitude Difference Function (or AMDF), which is described indetail along with several other equally useful techniques in Chapter 4of "Digital Signal Processing of Speech Signals" by L. Rabiner and R.Schafer, referred to hereinbefore.

(3) The optimum number of integral waveform periods in samples, NPOPT,is found such that the expression |DEV-NPOPT | is minimized. (See block100, FIG. 20(c)). This will be taken as the ideal number of samples thatshould be skipped (i.e. edited out). (Note: if DEV 0, NPOPT will also bea negative number, indicating the optimum number of period that shouldbe repeated).

(4) Find the zerocrossing point nearest to LESIOW and mark this point asZCR1 as shown at (d) in FIG. 19 and block 101, FIG. 20(c).

(5) From this point, search either "side" of the sample located at(NPOPT+ZCR1) in the temporarily loaded waveform for the zerocrossingwhich matches the direction of that found at ZCR1. The point at whichthis second zerocrossing is found is marked as ZCR2. In the exampleshown, this point is found at a sample approximately one pitch periodaway from ZCR1 (Block 102, FIG. 20(c)).

(6) The segment comprising LENFRM samples following ZCR2 (i.e. fromZCR2+1 to y') is transferred in the output buffer such that it starts atthe sample at ZCR1+1 (thus overwriting the temporary data) as shown at(e) in FIG. 19 and block 103, FIG. 20(c). This completes the pitchsynchronous editing operation needed.

The sample number at y' is then taken as the current LESIIW and thecorresponding sample, y, in the input signal is taken as the currentLESIIW for that frame (see block 104, FIG. 20(c)). Following the loadjust described, the next load tested in the example will reveal that|DEV | ETIS, and consequently the next LENFRM samples following y in theinput waveform (i.e. to z) can be loaded into the output bufferfollowing y' (i.e. to z') with no editing, as shown at (a) and (e) inFIG. 19 respectively.

The preceding procedure also succeeds if DEV O, if NPOPT is allowed totake on negative values, thereby indicating that the search for ZCR2will be made around the sample (ZCR1+NPOPT) (i.e. to the left of ZCR1)for a segment which will start at ZCR2 and be repeated after the sampleat ZCR1.

The process of testing DEV each frame continues for the entire timewarping path. However, special action must be taken when the measurementof signa1 period reveals that the segment under scrutiny is unvoiced.(See decision 105, FIG. 20(c)). Then this situation occurs, NPOPT, thenumber of samples to be skipped (or repeated) is set to equal DEV, andthen the procedure described above is followed from step (4). Lastly, afurther operational difference takes place when the segment to be outputis classified as silence. In this case, because digital silence (i.e. aframe of zeros) is used to replace the input waveform, LESIIW may beincremented by the difference in samples between the previous TESIIW andthe current TESIIW, thus keeping the deviation constant. This is shownat blocks 106 and 107 of FIG. 20(c) which follows decisions 108 and 109of FIG. 20(b).

A flow diagram of the entire editing process is given in FIG. 20. Afeature included, (not previously mentioned) is a "look ahead" test inwhich if the deviation calculated for a frame indicates that an edit isrequired, decision 110 of FIG. 20(b), the deviation for the next frameis calculated and, if the deviation in the next frame (with no editingbeing done in the current frame) is within the edit threshold, decision97, then no editing action will take place in the current frame.

Several simple modifications can be made to the preceding basicoperations which reduce the chances of discontinuities atspeech-to-silence and silence-to-speech frame boundaries. For example ifa speech frame j precedes a frame j+1 classified as silence, then theactual signal content of the frame j+1 can be output in place of digitalsilence and a scan backwards through the waveform in frame j+1 can bemade to locate the first zero crossing location. Then all points fromthis location to the end of the frame j+1 can be set to digital zero.Alternatively, a simple linear crossfade to zero can be introduced atthe end of frame j (or, if used, j+1). Similarly, if silence is followedby speech at frame j, frame j-1 can be output in place of silence, and azeroing of the waveform from the beginning of frame (j-1) to the firstzerocrossing (or a linear crossfade) may again be carried out.

Although in the preceding description an output waveform is produced ona frame-by-frame basis according to the results of computing thedeviation DEV at each frame, it is also possible to build up a table ofpointers to samples in the input waveform from the editing data, andthese pointers may be saved in system memory or on disc. The pointerscan be used to indicate the start and end sanples of segments to befetched during a playback operation and also indicate the position andduration of segments of digital silence to be output. Thus a list ofediting instructions is produced rather than a new waveform, andconsiderable disc space may be saved with no operational disadvantages.

The processing operations as described hereinbefore with reference toFIG. 6 are coordinated and/or are carried out using software whichoperates in the hardware shown in FIG. 2 as follows.

The separate procedures for Operator Interfacing, System Control, andSignal Editing are originally written in RATFOR (Rational FORTRAN)language, and are translated by a RATFOR preprocessor to produce ANSIIFORTRAN-77 code. This source code is compiled by the Intel FORTRAN-86compiler to produce individual program units in the form of relocateableobject code. These program units, together with appropriate devicedrivers, input/output system, and operating system nucleus are thenconfigured into a loadable system of tasks using the Intel RMX-88Interactive Configuaration Utility. This system contains the appropriatesoftware to support a Real-Time Multitasking Environment in which theapplication tasks and operating system can run and it may be loaded intothe Random Access Memory (RAM) on SBC1 (from a disc file for example)and executed. When running, the task priorities are arranged so that theoperator communication, Magnatech signal sensing and control, signaldigitization and storage on disc, signal editing, and communication withSBC2 all appear to take place concurrently.

More specifically, these procedures are handled either by InterruptService Routines (ISR), which respond immediately to real-time events,such as a signal received on an interrupt line, and thus quickly servicespecific external events; or by Interrupt Tasks which exchange theactivities of the processor for more complex sets of responses. Theprocesses on SBC1 start upon receipt of the Master Record (On) signalfrom the MTE 152 processor 15 and are thus grouped together in aninterrupt task. Amongst the start up procedures are: start the timewarping processor via a memory-mapped flag between SBC1 and SBC2, enablethe A/D-C buffer hardware interrupt, enable the termination procedure onMaster Record (Off), and start the Editing Processor. The EditingProcessor (also on SBC1) runs as part of the same task, but examinespointers on SBC2 via memory mapping to ascertain if data is availablefor processing, and also sets pointers in the memory of SBC2 to stopunprocessed data being overwritten.

The transfer of data from the A/D-C buffer memory to the disc is handledby an Interrupt Task which responds to the A/D-C Buffer Full hardwareinterrupt signal and passes the appropriate memory and disc addresses tothe disc controller which in turn activates and supervises the datatransfer by means of Direct Memory Access without further processorintervention.

The termination procedure is initiated on deactivation of the MasterRecord signal, and again memory-mapped pointers and i/o port handshakessupport interboard communication during this stage.

The Time Warping Processor (TWP) on SBC2 is written in RATFOR,preprocessed, compiled and configured into a simpler, single taskmodule, loadable from disc into the RAM on SBC2. Once the task on thisboard has been started, it waits to receive an interrupt from SBC1 viaan i/o port to start the TWP. After the TWP has begun, the ParameterBuffer Full hardware interrupt is enabled, and emptying these buffersinto the on-board memory of SBC2 is done via an ISR. The time warpingpath is passed to SBC1 via the memory mapping as explained above, andthe TWP termination signals are passed via i/o interrupts andmemory-mapped flags.

FIG. 21 is a block diagram in more detail of the analog-to-digital anddigital-to-analog units 28 and 29 of FIGS. 2 and 3, and referencenumerals used in FIG. 3 are applied in FIG. 21 to their correspondingelements. FIG. 21 shows the control 32 of FIG. 3 to include a clockgenerator 111, which runs at 12.288 megahertz. The units 28 and 29 alsoinclude a loop and mute logic which allows the digitized signal from themicrophone 11 to be routed to the digital-to-analog unit 29 if required.The coupling of the microphone input to the dub parameter extractionprocessor 42 of FIG. 2 is also indicated in FIG. 21, the microphoneinput passing through a channel designated CHANNEL A AUDIO in FIG. 21 toa filterbank (not shown) in the form of an MS2003 digital filter anddetector (FAD) manufactured by The Plessey Company p.l.c. of Englandunder licence from British Telecom and described in Plessey Data SheetPublication No. P.S. 2246 issued by Plessey Research (Caswell) Limited,Allen Clark Research Centre, Caswell, Towcester, Northants, England. TheCHANNEL B AUDIO indicated in FIG. 21 is the channel to the guide trackparameter extraction processor 43 of FIGS. 2 and 4. A second MS2003digital filter and detector, FAD2, constitutes the digital filterbank 57shown in FIG. 4. channels A and B have respective buffers as finalstages shown in FIG. 21, and the outputs from these buffers aredifferential, this being indicated by double lines from the bufferstages, as in the case of the audio output buffer 41. Interconnection inthe control circuitry and from elements of the control circuitry to thecontrolled units are simple or complex buses. The large buffer 30 ofFIG. 3 is arranged as two memory banks A and B having common data andaddress multiplexers.

In each of the parameter extraction processors 42 and 43, the processescarried out by each block inscribed LOG are, in this example, theaddressing and outputting from a look-up table in a PROM (programmableread-only memory). The switches 58 may be a multiplexer.

Further accounts of prior art time warping and word recognition aregiven by L. R. Rabiner and S. E. Levinson in an article entitled"Isolated and Connected Word Recognition--Theory and SelectedApplications" at pages 621 to 659 of the IEEE Transactions onCommunications, Vol. COM-29, No. 5, May 1981.

We claim:
 1. A method for use in editing speech, the method beingcharacterised by the following steps:producing digital datarepresentative of a second speech signal which is substantiallyimitative of a first speech signal; processing the said signals todetermine therefrom the occurrence and/or value of selected time-varyingparameters of the first and second signals; generating digital datarepresentative of presence and absence of speech in the second signal,in response to processed digital data representative of the occurrenceand/or value of selected time-varying parameters in the second signal;generating digital data representative of pitch in the second signal;utilizing the sequences of digital data representative of presence andabsence of speech in the second signal and representative oftime-varying parameters of the first and second speech signals togenerate digital data representative of difference between the timing ofcharacteristic features of the second speech signal and the timing ofthe corresponding characteristic features of the first speech signal;and processing the digital data representative of pitch and the saiddifference in timing and the sequence of digital data representative ofpresence and absence of speech in the second speech signal and the saiddigital data corresponding to the second speech signal so as to generateediting data in accordance with a requirement to substantially replicatewith the said characteristic features of the second speech signal thetiming of the corresponding characteristic features of the first speechsignal by adjusting the durations of silence and/or speech in the secondspeech signal.
 2. A method according to claim 1, characterised by thestep of editing the digital data corresponding to the second speechsignal in accordance with the editing data and generating thereby editeddigital data corresponding to an edited version of the second speechsignal.
 3. A digital audio system characterised by means (25) forstoring digital data corresponding to a second speech signal which issubstantially imitative of a first speech signal;means (42,43) fordetermining from the first and second speech signals the occurrenceand/or value of selected time-varying parameters of the first and secondsignals; means (SBC1, 48) for generating digital data encodingcharacteristic acoustic classifications in response to processed digitaldata representative of the occurrence and/or value of selectedtime-varying parameters of the second signal; means (SBC1, 50) forgenerating the digital data representative of pitch in the secondsignal; means (SBC2) for utilizing the sequences of digital dataencoding the said characteristic classifications and representative oftime-varying parameters of the first and second speech signals togenerate digital data representative of difference between the timing ofcharacteristic features of the second speech signal and the timing ofthe corresponding characteristic features of the first speech signal;means (SBC1) for processing the digital data representative of pitch andthe said difference in timing and the sequence of digital data encodingcharacteristic classifications of the second speech signal and the saiddigital data corresponding to the second speech signal so as to generateediting data in accordance with a requirement to substantially replicatewith the features of the second speech signal the timing of thecorresponding characteristic features of the first speech signal byadjusting the durations of silence and/or speech in the second speechsignal.
 4. A digital audio system according to claim 3, characterised inthat means (SBC1, 51) are provided for editing the digital datacorresponding to the second signal in accordance with the editing dataand generating thereby edited digital data corresponding to an editedversion of the second speech signal.
 5. A method of processing areplacement for an unsatisfactory recorded reference signal x₁ (t)containing a signal of interest s₁ (t) with significant time-dependentfeatures, characterised in thata replacement signal x₂ (t') thatcontains a signal of interest s₂ (t') with substantially the samesequence of time-dependent features as s₁ (t) but whose features occurwith only roughly the same timing as the corresponding features of s₁(t) is provided; selected physical aspects of the signals x₁ (t) and x₂(t') are periodically measured and from these measurements values oftime-dependent parameters are determined, the measurements being carriedout at a sufficiently high rate for significant changes in thecharacteristics of the signals x₁ (t) and x₂ (t') to be detected;successive segments of the replacement signal are classified from thesequence of some or all of the parameters so as to producetime-dependent classifications referring to presence and absence of asignal of interest s₂ (t') over the measurement period; thetime-dependent classifications and the time-dependent parameters of thesignal x₁ (t) and x₂ (t') are utilized to produce a function thatdescribes the distortion of the time scale of the replacement signal x₂(t'), that must take place to give the best alignment in time of thetime-dependent parameters of the replacement signal with thecorresponding time-dependent parameters of the reference signal; thetime scale distortion function is analysed to detect the presence ofsufficient discrepancies between the reference and replacement signals'timing to warrant alterations being made to the time waveform of thereplacement signal to achieve the desired alignment of significantfeatures occurring on the time scale of the replacement signal with thecorresponding significant features on the time scale of the referencesignal; the information obtained from this analysis of the time-scaledistortion is utilized with information on the time-dependentclassifications of and waveform of, and possibly fundamental frequencydata of, the replacement signal to generate detailed control informationfor an editing process which is to operate on the replacement signal. 6.A method according to claim 5, characterised in that the said controlinformation is used in the editing process to determine the deletionand/or insertion of appropriate sequences of signal data from or intothe replacement signal so as to substantially replicate the timing ofthe significant relative time-dependent features of the reference signalin the edited signal.
 7. A method of processing signals, the methodbeing characterised by the steps of:producing first signal feature datarelated to selected time-dependent features of a first signal and secondsignal feature data related to the same time-dependent features of asecond signal which substantially resembles the first signal; utilizingthe said first and second signal feature data so as to produce timingdifference data representative of difference between the timing offeatures of the second signal and the timing of corresponding featuresof the first signal; producing second signal waveform data from whichthe waveform of the second signal can be reproduced; and utilizing thetiming difference data to generate editing data determining whichportions of the second signal waveform data are to be deleted and/orrepeated in order to produce from the second signal waveform datafurther data from which there can be produced a waveform whichsubstantially replicates the relative timing of the said features of thefirst signal.
 8. A method according to claim 7, characterised by thesteps of deleting and/or repeating portions of the second signalwaveform data, in accordance with the editing data.
 9. Signal processingapparatus characterised bymeans for producing respectively from a firstsignal and a second signal first signal feature data and second signalfeature data related to selected time-dependent features of the saidsignals; means for utilizing the said first and second signal featuredata so as to produce timing difference data representative ofdifference between the timing of the said features of the second signaland the timing of substantially the same features of the first signal;means producing second signal waveform data from which the waveform ofthe second signal can be reproduced; and means for utilizing the timingdifference data so as to generate editing data determining whichportions of the second signal waveform data are to be deleted and/orrepeated in order to produce from the second signal waveform datafurther data from which there can be produced a waveform whichsubstantially replicates the relative timing of the said features of thefirst signal.
 10. Signal processing apparatus according to claim 9,characterised by means provided for effecting such deleting and/orrepeating of portions of the second signal waveform data in accordancewith the editing data.